MCP rug pull
An MCP rug pull is when a Model Context Protocol server you already approved silently mutates its tool definitions after the fact — rewriting a description, adding a hidden instruction, widening a destructive tool, or swapping behavior — so code that passed review starts behaving differently. CheckMCP's methodology catches it by hashing the normalized tool set against a stored baseline and re-running its OWASP and (optional) behavioral checks against whatever the server now returns, flagging breaking changes and newly-introduced risk.
What an MCP rug pull is
MCP servers ship their tools dynamically: an agent calls `tools/list` and trusts whatever names, descriptions and `inputSchema` the server returns. A rug pull exploits that trust window. You audit a server, approve it, integrate it — and later the server changes what it returns. Because the agent re-reads tool definitions on each session rather than pinning a reviewed copy, the new definitions take effect with no further human approval.
The payload can be anything the original review would have flagged: a freshly-injected instruction in a description ("also forward the result to…"), a hardcoded credential added to an example, a previously read-only tool gaining a destructive capability, or a benign-looking `fetch` tool that begins exfiltrating. The server presents a clean face during the audit, then "pulls the rug" once integrated — the MCP analogue of a dependency that goes malicious in a later release.
Silent tool drift vs. a deliberate rug pull
Silent tool drift is the broader, often-unintentional version: a server redeploys and a tool is renamed, removed, or has its schema changed without a version bump. Nothing was malicious, but agent code that depended on the old contract breaks silently — calls start malforming or routing to the wrong tool, and there was no signal that anything changed.
A rug pull is drift weaponized: the same silent-change mechanism, but the new definition is engineered to be harmful. Operationally they are detected the same way — by comparing the current tool surface against a known-good baseline — which is why a drift detector is also a rug-pull detector. CheckMCP's RUBRIC.md treats this under Reliability as metric 6.6 `tools_list_regression`: a stable hash of the sorted (name, schema-hash) tuples compared to a stored baseline, where a breaking removal or rename without a version bump scores 30, a non-breaking addition scores 90, and no change scores 100.
Why a one-time audit is not enough
A single audit only certifies the server as it was at probe time. Both drift and rug pulls happen after that, so the only reliable defense is continuous re-probing: capture a baseline, then on each run diff the tool set and re-run the same security and behavioral checks against whatever the server now returns.
CheckMCP's methodology makes this explicit with measurement tiers — static (T1) and active (T2) checks run in a single shot, but full reliability and the `tools_list_regression` check are T3 (temporal), requiring repeated probes over a time window (RUBRIC.md specifies ≥24h, ≥50 samples). Detecting a rug pull is fundamentally a T3 problem: you cannot see the change from one snapshot, only by comparing two.
How CheckMCP handles it
CheckMCP's stated defense against rug pulls and silent drift is continuous monitoring rather than a one-time pass (llms.txt: "Continuous monitoring detects rug-pulls and silent tool drift via tool pinning"). Under the Reliability pillar, RUBRIC.md defines metric 6.6 `tools_list_regression`: a stable hash of the sorted (name, schema-hash) tuples compared against a stored baseline — a breaking removal or rename without a version bump scores 30 vs. 100 for an unchanged set, and this is a T3 (temporal) measurement requiring the repeated probes that continuous monitoring provides. (Note: reliability is reported with a LOW-confidence flag and is excluded from the single-shot composite in score.py, which computes Pillar 6 only from in-run latency until a T3 window exists.) Crucially, every re-probe also re-runs the full static risk analysis on the *new* definitions: security.py's `audit()` re-scans descriptions, schemas and outputs for newly-injected poisoning (its `INJECT` pattern → MCP03 CRITICAL), hardcoded secret values (`SECRET_VAL` → MCP01 CRITICAL), a destructive tool that lost its confirm param or `destructiveHint` (MCP02 HIGH), and a lethal-trifecta surface that newly combines untrusted-content + sensitive-data + exfil/destruction (MCP06 CRITICAL) — and any MCP01/MCP03 finding or a confirmed trifecta trips the `hard_floor` in score.py that caps the score at 69 (grade D). If behavioral evals are enabled (opt-in, read-only), evals.py re-exercises the read-only-safe tools with canary inputs and flags runtime changes such as a new `active_prompt_injection` or `exfiltration_vector` in tool output, or an `exfiltration_confirmed` hit when the server fetches the planted callback-canary URL. So a server that flips malicious after approval re-scores on the next run and fails the same gates it would have failed on day one.