Scrape Engine

Naive Scrape

To access data from the web, we can use the naive_scrape interface. The engine underneath is very lightweight and can be used to scrape data from websites. It is based on the requests library, as well as trafilatura for output formatting, and bs4 for HTML parsing. trafilatura currently supports the following output formats: json, csv, html, markdown, text, xml

from symai.interfaces import Interface

scraper = Interface("naive_scrape")
url = "https://docs.astral.sh/uv/guides/scripts/#next-steps"
res = scraper(url)

Parallel (Parallel.ai)

The Parallel.ai integration routes scrape calls through the official parallel-web SDK and can handle PDFs, JavaScript-heavy feeds, and standard HTML pages in the same workflow. Instantiate the Parallel interface and call .scrape(...) with the target URL. The engine detects scrape requests automatically whenever a URL is supplied.

from symai.extended import Interface

scraper = Interface("parallel")
article = scraper.scrape(
    "https://trafilatura.readthedocs.io/en/latest/crawls.html",
    full_content=True,           # optional: request full document text
    excerpts=True,               # optional: default True, retain excerpt snippets
    objective="Summarize crawl guidance for internal notes."
)
print(str(article))

Configuration requires a Parallel API key and the Parallel model token. Add the following to your settings:

When invoked with a URL, the engine hits Parallel's Extract API and returns an ExtractResult. The result string joins excerpts or the full content if requested. Because processing is offloaded to Parallel's hosted infrastructure, the engine remains reliable on dynamic pages that the naive scraper cannot render. Install the dependency with pip install parallel-web before enabling this engine.

Firecrawl

Firecrawl.dev specializes in reliable web scraping with automatic handling of JavaScript-rendered content, proxies, and anti-bot mechanisms. It converts web pages into clean formats suitable for LLM consumption and supports advanced features like actions, caching, and location-based scraping.

Examples

Configuration

Enable the engine by configuring Firecrawl credentials:

JSON Schema Extraction

Firecrawl supports structured data extraction using JSON schemas. This is useful for extracting specific fields from web pages using LLM-powered extraction:

Supported Parameters

The engine supports many parameters (passed as kwargs). Common ones include:

  • formats: Output formats (["markdown"], ["html"], ["rawHtml"])

  • only_main_content: Extract main content only (boolean)

  • proxy: Proxy mode ("basic", "stealth", "auto")

  • location: Geographic location object with optional country and languages

    • Example: {"country": "US"} or {"country": "RO", "languages": ["ro"]}

  • maxAge: Cache duration in seconds (integer)

  • storeInCache: Enable caching (boolean)

  • actions: Page interactions before scraping (list of action objects)

    • Example: [{"type": "wait", "milliseconds": 2000}]

    • Example: [{"type": "click", "selector": ".button"}]

    • Example: [{"type": "scroll", "direction": "down", "amount": 500}]

Check the Firecrawl v2 API documentation for the complete list of available parameters and action types.

Last updated