Neuro-Symbolic Engine

The neuro-symbolic engine is our generic wrapper around large language models (LLMs) that support prompts, function/tool calls, vision tokens, token‐counting/truncation, etc. Depending on which backend you configure (OpenAI/GPT, Claude, Gemini, Deepseek, Groq, Cerebras, llama.cpp, HuggingFace, …), a few things must be handled differently:

  • GPT-family (OpenAI) and most backends accept the usual max_tokens, temperature, etc., out of the box.

  • Claude (Anthropic), Gemini (Google), Deepseek, Cerebras, and Qwen (Groq) can return an internal "thinking trace" when you enable it.

  • Local engines (llamacpp, HuggingFace) do not yet support token counting, JSON format enforcement, or vision inputs in the same way.

  • Groq engine requires a special format for the NEUROSYMBOLIC_ENGINE_MODEL key: groq:model_id. E.g., groq:qwen/qwen3-32b.

  • Cerebras engine requires a special format for the NEUROSYMBOLIC_ENGINE_MODEL key: cerebras:model_id. E.g., cerebras:gpt-oss-120b.

  • OpenAI Responses API engine requires the responses: prefix: responses:model_id. E.g., responses:gpt-4.1, responses:o3-mini. This uses OpenAI's newer /v1/responses endpoint instead of /v1/chat/completions.

  • Token‐truncation and streaming are handled automatically but may vary in behavior by engine.

❗️NOTE❗️the most accurate documentation is the code, so be sure to check out the tests. Look for the mandatory mark since those are the features that were tested and are guaranteed to work.


Basic Query

from symai import Symbol, Expression

# A one-off question
res = Symbol("Hello, world!").query("Translate to German.")
print(res.value)
# → "Hallo, Welt!"

Under the hood this uses the neurosymbolic engine.


Raw LLM Response

If you need the raw LLM objects (e.g. openai.ChatCompletion, anthropic.types.Message/anthropic.Stream, or google.genai.types.GenerateContentResponse), use raw_output=True:


Function/Tool Calls

Models that support function calls (OpenAI GPT-4, Claude, Gemini, …) can dispatch to your symai.components.Function definitions:

For Claude the API shapes differ slightly:


Thinking Trace (Claude, Gemini, Deepseek, Groq, OpenAI Responses)

Some engines (Anthropic's Claude, Google's Gemini, Deepseek, OpenAI Responses API with reasoning models) can return an internal thinking trace that shows how they arrived at an answer. To get it, you must:

  1. Pass return_metadata=True.

  2. Pass a thinking= configuration if required.

  3. Inspect metadata["thinking"] after the call.

Claude (Anthropic)

Gemini (Google)

Deepseek

Groq

Cerebras

For Cerebras backends, symai collects the reasoning trace from either the dedicated reasoning field on the message (when present) or from <think>…</think> blocks embedded in the content. In both cases the trace is exposed as metadata["thinking"] and removed from the final user-facing text.

OpenAI Responses API (reasoning models)

For OpenAI Responses API with reasoning models (e.g., o3-mini, o3, o4-mini, gpt-5, gpt-5.1), the thinking trace is extracted from the reasoning summary in the response output.


JSON‐Only Responses

To force the model to return valid JSON and have symai validate it:

Groq JSON mode caveat

Groq currently has a quirk (arguably a bug) when combining JSON Object Mode with an explicit tool_choice: "none". Their API may return:

This happens because JSON Object Mode internally invokes a JSON “constrainer” tool (<|constrain|>json), which collides with tool_choice: "none". Groq’s own docs state JSON modes can’t be mixed with tool use; here, the model implicitly “uses a tool” even when you didn’t ask. As of Aug. 22, 2025, this behavior is triggered regardless of whether you set tool_choice: "auto" or tool_choice: "none". Enforcing not choosing that tool via prompting also doesn't work.


Token Counting & Truncation

The default pipeline will automatically estimate token usage and truncate conversation as needed. On GPT-family backends, raw API usage in response.usage matches what symai computes. For Gemini, an API call is made to retrieve token counts. For Claude / llama.cpp / HuggingFace, skip token‐comparison tests as they are not uniformly supported yet.

If a tokenizer is available for the current engine, you can easily count tokens in a string via Symbol:

Tracking Usage and Estimating Costs with MetadataTracker

For more detailed tracking of API calls, token usage, and estimating costs, you can use the MetadataTracker in conjunction with RuntimeInfo. This is particularly useful for monitoring multiple calls within a specific code block.

❗️NOTE❗️we only track OpenAI models for now (chat and search).

MetadataTracker collects metadata from engine calls made within its context. RuntimeInfo then processes this raw metadata to provide a summary of token counts, number of API calls, elapsed time, and an estimated cost if pricing information is provided.

Here's an example of how to use them:

This approach provides a robust way to monitor and control costs associated with LLM API usage, especially when making multiple calls. Remember to update the pricing dictionary with the current rates for the models you are using. The estimate_cost function can also be customized to reflect complex pricing schemes (e.g., different rates for different models, image tokens, etc.).

The extras Field

RuntimeInfo includes an extras dictionary for engine-specific usage metrics that don't fit the standard token fields. For example, when using ParallelEngine, the tracker captures additional usage items like sku_search (number of search operations) and sku_extract_excerpts (number of excerpt extractions):

When aggregating RuntimeInfo objects with +, numeric values in extras are summed, while non-numeric values are overwritten.


Preview Mode

When building new Function objects, preview mode lets you inspect the prepared prompt before calling the engine:


Self-Prompting

If you want the model to “self-prompt” (i.e. use the original symbol text as context):


Vision Inputs

Vision tokens (e.g. <<vision:path/to/cat.jpg:>>) can be passed in prompts on supported backends:

Last updated