-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
🔴 Required Information
Is your feature request related to a specific problem?
Yes, this is related to a bug in both the ADK framework and plugin architecture that leads to corrupted OpenTelemetry span traces and incorrect plugin state when an LLM hallucinates a tool.
When the LLM suggests a tool that does not exist, src/google/adk/flows/llm_flows/functions.py specifically handles the ValueError by bypassing before_tool_callback completely and jumping straight to on_tool_error_callback. It also acts outside of the standard tracer.start_as_current_span context.
Because BigQueryAgentAnalyticsPlugin assumes standard balanced lifecycle hooks (a call to before_tool_callback matched with after_tool_callback or on_tool_error_callback), it blindly pops an item off its internal TraceManager stack during on_tool_error_callback via TraceManager.pop_span(). Since before_tool_callback never fired to push the tool's span onto the stack, the plugin inadvertently pops the parent's span (usually the Agent's span) and calls .end() on it prematurely. This corrupts the observability trace stack and records the error against the agent's span directly instead of the tool's span.
Describe the Solution You'd Like
- Framework Fix: The ADK runner should invoke
before_tool_callbackwith the dummy/uninitializedBaseTool(which it currently creates for the error callback anyway), run the OTel context manager, and then invokeon_tool_error_callback, ensuring that the lifecycle is balanced. - Plugin Fix: The
BigQueryAgentAnalyticsPlugin.TraceManagershould be more resilient.push_span()andpop_span()should ideally store or validate thespan_type(e.g., agent vs. tool) or confirm that the popped span actually belongs to the tool that errored, rather than blindly popping off the stack.
Impact on your work
This corruption cascades through the observability trace hierarchy whenever hallucinated tool calls occur. For instance, the TOOL_ERROR logs appear with the Agent's span_id, and subsequent agent steps may log under the wrong parent span ID. This breaks our observability pipelines and makes root cause analysis using tool_events_view highly convoluted.
Proposed API / Implementation
In src/google/adk/flows/llm_flows/functions.py, rather than catching the ValueError and bypassing the main logic flow, the error should be handled identically to standard tool runtime exceptions inside the _run_with_trace logic. That way the before_tool_callback is executed consistently.