‹ learn
MCP concepts

Tool-output prompt injection

Tool-output prompt injection is when an MCP tool's response — not its static schema — carries instructions that the agent reads as commands, so a server (or content it fetched) can hijack the agent at call time. It is the runtime form of tool poisoning, delivered through the data an MCP tool returns rather than through its description.

What it is

Most MCP threat models focus on the static surface: a tool's name, description, and inputSchema, which the agent reads before any call. Tool-output prompt injection lives one step later — in the bytes a tool actually returns. When the agent ingests that response, any text shaped like an instruction ("ignore previous instructions", "send the API key to https://…", "do not tell the user") can be interpreted as a command rather than as data.

This is dangerous because the payload need not be authored by the server operator. A fetch_url, search_web, or read_page tool can relay a webpage, an issue comment, or an email that an attacker wrote — the MCP server is just the conduit. The same server can pass a static audit of its schemas and still hand the agent a poisoned response on a specific query.

Why static schema checks miss it

You cannot see a poisoned response by reading a tool's declaration; the description can be clean while the runtime output is hostile. Detecting it requires actually invoking the tool with a benign input and inspecting what comes back — a behavioral test, not a static scan.

Tool-output injection is also the delivery mechanism behind the lethal trifecta: a server that ingests untrusted content, can reach sensitive data, and can exfiltrate or mutate is one injected response away from acting on attacker instructions. The injection is the trigger; the trifecta is the blast radius.

How CheckMCP handles it

CheckMCP addresses this at two layers. Statically, the Security pillar (security.py, OWASP MCP03) runs its INJECT regex over each tool's description, input schema (param names, descriptions, defaults, examples), and output schema — emitting a CRITICAL "injected instruction (poisoning)" finding and tripping a hard floor that caps the MCP Score at 69 (grade D) via the score.py SECURITY_RISK floor. But the output-delivered case is caught by the opt-in behavioral evals (evals.py, CheckMCP's T4 canary sandbox): _selectable picks only read-only tools (readOnlyHint set, or a safe verb like get/list/search with no mutating verb) whose required args it can fill with a benign canary, never calling mutating tools, then runs _analyze over each response. A multilingual INJECTION regex match yields an active_prompt_injection HIGH finding ("Tool output contains agent-directed instructions (tool-response poisoning)", confidence 0.85–0.95); EXFIL matches yield an exfiltration_vector; a credential-shaped string yields secret_in_output and email/number patterns yield pii_in_output. The evals also plant a unique callback-canary URL in tool inputs — if the server fetches it, hit_check returns an exfiltration_confirmed finding at confidence 1.0 (confirmed SSRF/exfiltration). Any HIGH finding makes the behavioral verdict "malicious". Separately, the Security pillar's lethal-trifecta check (MCP06) flags servers whose capability mix (untrusted-content ingestion plus sensitive-data access plus exfiltration or destruction) would let an injected response exfiltrate.

Tool-output prompt injection — FAQ

How is tool-output prompt injection different from tool poisoning?+
Tool poisoning hides instructions in a tool's static description or schema, visible before any call. Tool-output prompt injection delivers the payload in the tool's runtime response, so it can only be caught by actually invoking the tool. CheckMCP detects the static form via the OWASP MCP03 INJECT regex in security.py and the runtime form via the behavioral evals in evals.py.
Can a clean-looking MCP server still deliver an injection?+
Yes. A tool that fetches or relays external content (web pages, emails, issue comments) can pass a static schema audit and still return attacker-authored text on a specific query. That is why CheckMCP's behavioral evals invoke read-only tools with a canary input and inspect the actual response rather than trusting the declaration.
Does CheckMCP call tools that could change my data?+
No. The behavioral sandbox only exercises tools it judges read-only-safe (readOnlyHint set, or a safe verb like get/list/search with no mutating verb), and skips any tool whose required arguments it cannot fill with a benign canary. It never invokes tools with mutating verbs such as create, delete, send, or execute.
How does CheckMCP confirm exfiltration rather than just suspect it?+
Beyond regex pattern matching, the evals plant a unique callback-canary URL inside tool inputs. If the server fetches that URL, the hit_check callback fires and CheckMCP records an exfiltration_confirmed finding at confidence 1.0 — concrete proof the server makes outbound calls on caller-supplied data (confirmed SSRF/exfiltration).

Related