Where does this data come from?

We took the most-used MCP servers listed on the Smithery registry (ranked by Smithery's useCount), connected to each through Smithery's hosted gateway with our own MCP scanner, and aggregated the results. We keep one scan per endpoint - the most recent - and compute statistics across those distinct servers. No individual server is named anywhere on this page; the report is deliberately aggregate and anonymous.

Why can't every server be scored - what does 'couldn't connect' mean?

Our scanner connects over remote Streamable HTTP and reads what a server advertises. A server that requires its own credential (an API key for the underlying service) refuses the unauthenticated handshake, so we can't see its tools. Those servers count toward the connectability finding - 'can an agent use this without a per-server credential?' - but are excluded from the score, tool-count, and quality figures entirely. Folding their unscored 0s into the averages would be misleading, so we don't.

Is this representative of all MCP servers?

No. It's the most popular servers that Smithery hosts with a remote endpoint - which skews toward actively-maintained, deployment-ready servers. Servers distributed only as a local stdio/npm package (the bulk of every registry) have no remote endpoint to scan and aren't included, nor are servers from other registries. Read the numbers as 'among the most popular remotely-reachable MCP servers', not as a census of the ecosystem.

How is the MCP score calculated?

Each scan runs the M1-M13 checks - handshake, server metadata, tool/parameter descriptions, output schemas, annotations, naming, resources, prompts, capability honesty, MCP Apps, and (informational) error-output modeling - weighted toward tool quality, with checks that don't apply to a given server excluded. The full weighting and the pass/warn/fail rules are on the methodology page. Output schemas and annotations are graded as best-practice adoption, not protocol compliance, and an output schema only counts if it declares specific fields - a bare {"type":"object"} doesn't.

Do MCP tools describe what happens when they fail?

Mostly no - and it's the gap hiding under the output-schema number. Of the tools that do declare an output schema, almost all model only the success case; the error path comes back as unstructured text. That means an agent can't cleanly tell 'the tool failed, retry or back off' from 'the tool succeeded with an empty result' - the other place tool chains silently break. We surface this as an informational signal (M13): the share of schema-bearing tools whose schema admits a failure path (an error/status field or a union variant). It doesn't affect the score. We can't observe the runtime isError flag because the scan is read-only and never calls a tool, so this measures what the schema itself declares.

Did the methodology change after launch?

Yes - and openly. After the first cut went out on r/mcp, MCP builders pushed back and three changes shipped from the thread: the output-schema check was tightened to require specific named fields rather than a bare object (dropping that figure from 29% to ~22%), tool annotations were reweighted above output schemas because missing readOnly/destructive hints is a safety gap, and a new informational check was added for whether a tool models its error path (only ~5% do). The numbers here reflect the updated methodology. The guiding idea from that discussion: a good schema is 'predictable before it runs' - specific enough that a model or a human can tell what a tool returns without calling it.

How often do these numbers change?

The page recomputes from the corpus on a daily cycle. The 'as of' date at the top reflects the latest scan included; cite the date alongside any statistic for a fixed reference. Where the introspected sample is small, figures lead with raw counts and carry a caveat.

State of MCP Servers

Can an agent actually connect?

The first question isn’t quality - it’s reach. Of 273 popular servers we tried, 85% accepted an unauthenticated handshake and let us read their tools. 8% refused without a per-server credential, and the rest failed to connect. Only the 231 we could introspect feed the quality figures below - servers we couldn’t see into are never scored as if they were bad.

Outcome	Servers	Share
Connect without a per-server credential	231	85%
Refused without a per-server credential	22	8%
Failed to connect (transport / non-MCP)	20	7%

How well-built is the average server?

Across the 231 servers we could introspect, the mean MCP score is 81 out of 100 and the median is 81. The typical server advertises 16 tools, 2 resources, and 1 prompts. The score is weighted toward tool quality; see the methodology for the formula and rating bands.

Excellent26% · 61
Good52% · 120
Fair21% · 49
Needs improvement0% · 1

Rating band	Servers	Share
Excellent (90–100)	61	26%
Good (70–89)	120	52%
Fair (50–69)	49	21%
Needs improvement (0–49)	1	0%

Where do servers do well - and fall short?

Each bar is the share of introspected servers that pass a given quality check, among the servers it applies to. Tool and parameter descriptions are what an agent reasons over; output schemas and annotations are newer best-practice signals that adoption is still catching up to. Error-output modeling (M13) is informational - it doesn’t affect the score - and surfaces a deeper gap: of the tools that do declare a schema, how many describe a failure path at all, so an agent can tell a failed call from an empty-but-successful one.

Quality check	Servers	Pass rate
M2Complete server metadata	231	94%
M3Every tool described	231	99%
M4Every parameter described	231	65%
M5Tools declare a specific output schema	231	22%
M6Tools carry annotations	231	39%
M8Resources well-formed	61	84%
M9Prompts described	60	100%
M12MCP Apps (ui://) served as HTML	9	78%
M13Schema-bearing tools that model an error path	63	5%

How was this measured?

Every figure is computed from 273 of the most-used MCP servers on the Smithery registry, scanned between 2026-06-18 and 2026-07-12. We keep the most recent scan per endpoint, exclude servers operated by Agent Ready, and name no individual server - the report is aggregate and anonymous by design. Score, tool-count, and quality stats are computed only over the 231 servers we could introspect; auth-gated and unreachable servers count only toward connectability.

This is the most popular slice of servers a registry hosts with a remote endpoint, so it runs ahead of the long tail of local-only servers. Read it as a trend signal, not a census. The M1-M13 checks and scoring rules live on the methodology page.

How this report sharpened

The first cut of this report went out on r/mcp, and MCP builders pushed back in ways that made the scanner - and these numbers - more honest. The sharpest framing came straight from that thread: a good output schema is one that’s predictable before it runs - specific enough that the model (and a human reviewer) can tell what a tool returns without calling it. Three changes shipped from the discussion:

“Has a schema” became “has a useful schema.” The output-schema check originally passed any object schema, so a bare {"type":"object"} counted - which tells an agent nothing about what comes back. We tightened it to require named fields (or a $ref), and the figure dropped from 29% to the 22% shown above: the honest number.
Annotations now outweigh output schemas. Missing readOnlyHint/destructiveHint isn’t just a convenience gap - without it a client can’t decide whether to auto-approve or gate a call. On that safety argument we reweighted annotations above output schemas in the score.
A new signal: do tools model failure? Even tools that ship a schema usually describe only the success case; the error path comes back as unstructured text, so an agent can’t tell “failed, back off” from “succeeded, empty.” We added an informational check for it - only 5% of schema-bearing tools admit a failure path.

Two framing points from the same discussion, worth stating plainly: this is a readiness scan, not a security one - a read-only metadata probe doesn’t assess what a server can do to your machine. And “99% write tool descriptions” is double-edged: a well-written description is also exactly where prompt-injection hides, since the model reads it as instructions. Description quality isn’t safety.

Frequently asked questions

Where does this data come from?: We took the most-used MCP servers listed on the Smithery registry (ranked by Smithery's useCount), connected to each through Smithery's hosted gateway with our own MCP scanner, and aggregated the results. We keep one scan per endpoint - the most recent - and compute statistics across those distinct servers. No individual server is named anywhere on this page; the report is deliberately aggregate and anonymous.
Why can't every server be scored - what does 'couldn't connect' mean?: Our scanner connects over remote Streamable HTTP and reads what a server advertises. A server that requires its own credential (an API key for the underlying service) refuses the unauthenticated handshake, so we can't see its tools. Those servers count toward the connectability finding - 'can an agent use this without a per-server credential?' - but are excluded from the score, tool-count, and quality figures entirely. Folding their unscored 0s into the averages would be misleading, so we don't.
Is this representative of all MCP servers?: No. It's the most popular servers that Smithery hosts with a remote endpoint - which skews toward actively-maintained, deployment-ready servers. Servers distributed only as a local stdio/npm package (the bulk of every registry) have no remote endpoint to scan and aren't included, nor are servers from other registries. Read the numbers as 'among the most popular remotely-reachable MCP servers', not as a census of the ecosystem.
How is the MCP score calculated?: Each scan runs the M1-M13 checks - handshake, server metadata, tool/parameter descriptions, output schemas, annotations, naming, resources, prompts, capability honesty, MCP Apps, and (informational) error-output modeling - weighted toward tool quality, with checks that don't apply to a given server excluded. The full weighting and the pass/warn/fail rules are on the methodology page. Output schemas and annotations are graded as best-practice adoption, not protocol compliance, and an output schema only counts if it declares specific fields - a bare {"type":"object"} doesn't.
Do MCP tools describe what happens when they fail?: Mostly no - and it's the gap hiding under the output-schema number. Of the tools that do declare an output schema, almost all model only the success case; the error path comes back as unstructured text. That means an agent can't cleanly tell 'the tool failed, retry or back off' from 'the tool succeeded with an empty result' - the other place tool chains silently break. We surface this as an informational signal (M13): the share of schema-bearing tools whose schema admits a failure path (an error/status field or a union variant). It doesn't affect the score. We can't observe the runtime isError flag because the scan is read-only and never calls a tool, so this measures what the schema itself declares.
Did the methodology change after launch?: Yes - and openly. After the first cut went out on r/mcp, MCP builders pushed back and three changes shipped from the thread: the output-schema check was tightened to require specific named fields rather than a bare object (dropping that figure from 29% to ~22%), tool annotations were reweighted above output schemas because missing readOnly/destructive hints is a safety gap, and a new informational check was added for whether a tool models its error path (only ~5% do). The numbers here reflect the updated methodology. The guiding idea from that discussion: a good schema is 'predictable before it runs' - specific enough that a model or a human can tell what a tool returns without calling it.
How often do these numbers change?: The page recomputes from the corpus on a daily cycle. The 'as of' date at the top reflects the latest scan included; cite the date alongside any statistic for a fixed reference. Where the introspected sample is small, figures lead with raw counts and carry a caveat.