Scrape Engine
Naive Scrape
To access data from the web, we can use the naive_scrape interface. The engine underneath is very lightweight and can be used to scrape data from websites. It is based on the requests library, as well as trafilatura for output formatting, and bs4 for HTML parsing. trafilatura currently supports the following output formats: json, csv, html, markdown, text, xml
from symai.interfaces import Interface
scraper = Interface("naive_scrape")
url = "https://docs.astral.sh/uv/guides/scripts/#next-steps"
res = scraper(url)Parallel (Parallel.ai)
The Parallel.ai integration routes scrape calls through the official parallel-web SDK and can handle PDFs, JavaScript-heavy feeds, and standard HTML pages in the same workflow. Instantiate the Parallel interface and call .scrape(...) with the target URL. The engine detects scrape requests automatically whenever a URL is supplied.
from symai.extended import Interface
scraper = Interface("parallel")
article = scraper.scrape(
"https://trafilatura.readthedocs.io/en/latest/crawls.html",
full_content=True, # optional: request full document text
excerpts=True, # optional: default True, retain excerpt snippets
objective="Summarize crawl guidance for internal notes."
)
print(str(article))Configuration requires a Parallel API key and the Parallel model token. Add the following to your settings:
{
…
"SEARCH_ENGINE_API_KEY": "…",
"SEARCH_ENGINE_MODEL": "parallel"
…
}When invoked with a URL, the engine hits Parallel's Extract API and returns an ExtractResult. The result string joins excerpts or the full content if requested. Because processing is offloaded to Parallel's hosted infrastructure, the engine remains reliable on dynamic pages that the naive scraper cannot render. Install the dependency with pip install parallel-web before enabling this engine.
Last updated