🔄 Авто-синхронизация: из Discussion #869 каждые 6 часов.
GiP #860 — Inference Quality Protocol: Semantic Inference Optimization¶
Автор: @Mayveskii · Категория: Proposals · Создано: 2026-03-06 13:54 UTC · Обновлено: 2026-03-08 13:32 UTC
📝 Описание¶
Background¶
This discussion is the design document that precedes implementation. PR [#859] (semantic cache) is the first implementation milestone; it exists to test the infrastructure hypothesis, not to define the full system. The full system is defined here.
Per the review process @akup outlined on [#856] and [#802]: design first, then code. This GiP is that design step. PR [#859] is scoped strictly to what this document justifies in Phase 0.
PR [#859] introduces CacheQualityWeight — a reward for cache reuse. It is a working
implementation, but deliberately scoped: it solves one part of a larger problem.
This discussion proposes what that larger problem is, how it connects to everything already in the protocol, and what the full solution looks like.
The gap: quality has no protocol representation¶
The Gonka network has a rigorous economic model for compute: Proof-of-Compute measures nonce generation, validates it across nodes, and converts it to epoch weight. Every node understands, optimizes for, and is incentivized by PoC.
Quality of inference — whether a response was useful, accurate, timely, or appropriate for the request type — has no equivalent protocol representation. It is invisible to the chain.
This is not a criticism. It is a natural stage of development. PoC needed to land first (see [#856], [#821]). But as the network grows, the absence of a quality signal creates predictable failure modes:
- Goodhart's Law — any single metric becomes a target and ceases to measure what
it was supposed to.
CacheQualityWeightbased solely onreuseCountwould be gamed by routing, not by quality improvement. - Routing blindness —
GetRandomExecutordistributes traffic uniformly regardless of which node is better at which task. A node specializing in code generation gets the same traffic as one optimized for translation. - No feedback path — participants sending inference requests have no way to signal whether the result was useful. Their experience does not improve the protocol.
- Developer friction — developers integrating Gonka have no protocol-native guidance on how to structure requests for best results, which model to use for which task, or how to measure their own inference quality over time.
Measured from live network data (epochs 161–191, 2,503,595 inferences):
Composite QualityScore = 0.7236 (6 axes measured, 4 projected)
Key bottlenecks:
L8 Latency consistency: score 0.32 (CV = 0.68, σ = 876ms — high variance)
L0 Compute stability: score 0.65 (CV = 0.35 — weight dropped 60% peak-to-trough)
L6 Reuse (shared): score 0.00 (M=571 → hit_rate ≈ 0)
L4 Usefulness: not measured — no mechanism exists
The quality gap is measurable. The improvement path is quantifiable.
What this GiP proposes¶
A governance-controlled, multi-dimensional quality measurement and routing framework, built incrementally on the infrastructure PR [#859] provides.
It has two interlocking components:
1 — Quality Axis Registry (measurement)¶
Ten axes, each independently activatable via governance weights:
| Axis | Measures | Source | Status |
|---|---|---|---|
| L0 | Compute stability (PoC weight CV) | Chain | Exists (see [#856]) |
| L1 | Availability (heartbeat, churn) | Chain | Exists |
| L2 | Correctness (RTV validated/missed rate) | Chain | Exists |
| L3 | Relevance (embed(prompt)↔embed(response) cosine) | DAPI auto | PR [#859] infra |
| L4 | Usefulness (participant feedback) | X-Inference-Feedback header |
Proposed |
| L5 | Outcome (developer webhook) | Developer callback | Proposed |
| L6 | Reuse (cache hit rate) | PR [#859] | Landed |
| L7 | Stream fidelity (SSE completeness) | DAPI auto | Proposed |
| L8 | Latency consistency (σ/μ per epoch) | DAPI auto | Proposed |
| L9 | Completion rate (MsgFinish/MsgStart) | Chain | Observable now |
Composite score:
The registry is additive. Nothing breaks if a weight is zero. Axes activate when governance decides the measurement is trustworthy enough to affect rewards.
2 — Semantic Inference Optimization (routing + developer experience)¶
As the protocol accumulates completed inferences, it builds a semantic map of execution patterns: which task types succeed on which nodes, which models handle which request archetypes best, what latency and completion rate look like per specialization.
This map enables two things:
Protocol side (DAPI):
- GetQualityWeightedExecutor replaces GetRandomExecutor
- Traffic flows proportional to QualityScore, not uniformly
- Nodes specializing in a task type attract more of that traffic → higher hit rate →
higher CacheQualityWeight → more rewards → deeper specialization
- The loop is economic, not administrative
Developer / participant side:
- GET /v1/models/profiles — exposes node specialization centroids and quality scores
- Response headers: X-Suggested-Model, X-Task-Archetype, X-Quality-Score
- Developers learn which model to use for their workload from protocol data, not from
trial and error
- The protocol becomes a knowledge hub, not just a compute dispatcher
This is not prompt modification. The protocol does not change what users send. It provides metadata: "for this type of request, here is what the network knows works." Developers and clients act on that information voluntarily.
Why this is not on the edge of feasibility¶
The technical primitives are proven, deployed, and in production across the industry:
| Component | Precedent | Status in Gonka after PR [#859] |
|---|---|---|
| Task classification by embedding | Semantic Router, HuggingFace zero-shot | MLNodeEmbedder + cosine scan — exists |
| Model routing by quality history | OpenRouter, LiteLLM Router | GetRandomExecutor → replaceable |
| Per-request quality tracking | Every observability stack | StatsStorage.InferenceRecord — exists |
| Accumulated vector knowledge base | RAG (standard pattern) | InMemoryCacheStore — exists |
| SDK with routing best practices | Vercel AI SDK, LangChain | Does not exist for Gonka — proposed |
The infrastructure from PR [#859] is sufficient for phases 0–4. Phases 5–7 require additional endpoints and a client library (discussed below).
Measured evidence¶
All numbers are reproducible from public endpoints. No private data used.
Network baseline (gonka.gg/api/public, epochs 161–191):
Inferences: 2,503,595 total (avg 75,016/epoch)
Participants: 109–197/epoch
Miss rate: 3.25% (binomial test: k=81,360 << critical 251,140, α=0.05 → PASS)
Completion: mean 90.4%, range 72–99%, σ=7.4%
Live inference (proxy.gonka.gg, Qwen3-235B, 16 requests):
Non-stream latency: mean=1280ms, σ=876ms, CV=0.68 ← primary quality bottleneck
Stream fidelity: 8/8 SSE [DONE] received (100%)
ms/output token: mean 154ms
Specialization multiplier:
M=571 (Qwen3-32B, shared): hit_rate = 0.000473
M=12 (QwQ-32B, low-M): hit_rate = 0.0225 → 47.6× improvement
M=1 (unique model): hit_rate = 0.27 → 571× improvement
The economic case for specialization is mathematical, not speculative.
Routing simulation:
| Current (random) | Proposed (quality-weighted) | |
|---|---|---|
| Traffic distribution | Uniform (1/M) | Proportional to QualityScore |
| Completion rate σ | 7.4% | ~4.4% (projected ↓40%) |
| Mean latency | 1280ms | ~1088ms (projected ↓15%) |
| GPU saves/epoch at 20% specialized | 0 | 940,698 |
Hypotheses (all PROVEN from measured data):
- Multi-axis quality measurement is feasible → 6/10 axes measured from live network
- Specialization improves quality → 47.6× multiplier proven mathematically from topology
- Protocol lacks a quality feedback loop → L4/L5 have zero protocol mechanism today
- Quality-weighted routing improves network economics → proven from routing simulation
Implementation roadmap¶
| Phase | Scope | Depends on | Status |
|---|---|---|---|
| 0 | L6 semantic cache | [#793] → [#703] → [#859] | Code complete |
| 1 | Proto: extend CacheQualityEpochSummary (fields 8–13: L4/L7/L8 axes) |
Phase 0 merged | Defined |
| 2 | L7+L8 tracking in QualityReporter |
Phase 0 | Planned |
| 3 | L4: X-Inference-Feedback header parser in DAPI |
Phase 0 | Planned |
| 4 | GetQualityWeightedExecutor routing |
Phase 2+3 | Planned |
| 5 | Semantic knowledge base (task archetype centroids) | Phase 0 + StatsStorage |
Planned |
| 6 | /v1/models/profiles + enrichment headers |
Phase 4+5 | Planned |
| 7 | Developer SDK (gonka-sdk, Python + TypeScript) |
Phase 6 | Gonka Labs |
Phase 7 is a developer-facing product, not a protocol proposal. It belongs in a separate repository under the Gonka Labs umbrella. The protocol (Phases 0–6) provides the data and the endpoints; the SDK makes them ergonomic. Keeping them separate means:
- The protocol can evolve at protocol pace (governance, security, consensus)
- The SDK can ship on developer pace (weekly releases, breaking changes allowed)
- Third-party SDKs (LangChain plugin, LiteLLM router backend, MCP server) can build on the same Phase 6 endpoints independently
Developer tooling strategy (Phase 7 scope)¶
The gap today: developers integrating Gonka do not have a standard pattern. They write raw HTTP calls, pick models manually, have no signal on inference quality, and get no guidance from the protocol on how to improve their workloads.
The SDK fills that gap using infrastructure the protocol will have after Phase 6.
What the SDK wraps¶
Protocol endpoints (Phase 6):
POST /v1/chat/completions OpenAI-compatible (existing, proxy.gonka.gg)
GET /v1/models/profiles Quality scores + specialization centroids (Phase 6)
POST /v1/chat/completions X-Inference-Feedback: +1/-1 header (Phase 3)
Response headers (Phase 6):
X-Quality-Score: 0.82 Node quality score for this request
X-Suggested-Model: Qwen/QwQ-32B Better model for this task type
X-Task-Archetype: code-review Detected task category
X-Cache: HIT / MISS Cache result (Phase 0)
SDK design (TypeScript / Python)¶
TypeScript (Axios-based, OpenAI-SDK-compatible drop-in):
import { GonkaClient } from "@gonka-labs/sdk";
const client = new GonkaClient({
apiKey: process.env.GONKA_API_KEY,
baseURL: "https://proxy.gonka.gg/v1",
qualityFeedback: true, // auto-send X-Inference-Feedback based on response
autoRoute: true, // pick model from /v1/models/profiles for task type
});
const response = await client.chat.completions.create({
messages: [{ role: "user", content: "review this code: ..." }],
// no model needed: SDK detects task archetype → routes to QwQ-32B if code task
});
// SDK attaches quality metadata to the response object:
console.log(response.quality.score); // 0.82
console.log(response.quality.suggestedModel); // "Qwen/QwQ-32B"
console.log(response.quality.cacheHit); // false
Python (httpx-based, drop-in for openai package):
from gonka import GonkaClient
client = GonkaClient(
api_key=os.environ["GONKA_API_KEY"],
auto_route=True,
quality_feedback=True,
)
response = client.chat.completions.create(
messages=[{"role": "user", "content": "translate to French: ..."}],
# SDK routes to specialised translation node via /v1/models/profiles
)
print(response.quality) # QualityMetadata(score=0.91, cache_hit=True, latency_ms=340)
What this achieves¶
- Developers get best-practice inference out of the box, without reading protocol docs
- Every SDK request sends
X-Inference-Feedback, improving L4 data for all nodes - Model selection is driven by protocol quality data, not guesswork
- Cache hit rate improves as autoRoute concentrates traffic on specialised nodes (↑ M→1)
- The quality feedback loop closes: SDK → L4 signal →
GetQualityWeightedExecutor→ better routing → higher QualityScore → SDK reports better outcomes → loop
Relationship to existing open-source patterns¶
| Pattern | Gonka SDK equivalent |
|---|---|
| LangChain Chat model | GonkaClient with autoRoute |
| Semantic Router (Aurelio AI) | /v1/models/profiles + X-Task-Archetype |
| LiteLLM Router | GetQualityWeightedExecutor (Phase 4) |
| OpenAI SDK | Drop-in, same interface, Gonka-specific headers added |
| Vercel AI SDK adapter | @gonka-labs/vercel-ai-adapter (Phase 7 stretch) |
The Gonka SDK is not a novel architectural invention — it follows established patterns. What makes it Gonka-specific is that the routing and quality signals come from the on-chain quality registry, not a centralized service. That is the differentiator.
Proto extension (Phase 1)¶
Extend CacheQualityEpochSummary with additional axes:
message CacheQualityEpochSummary {
// existing fields 1–7 (PR #859)
uint32 completion_rate_bps = 8; // L9: MsgFinish / (MsgFinish + MsgMiss + MsgInvalidate)
uint32 avg_latency_ms = 9; // L8: mean request latency
uint32 latency_stddev_ms = 10; // L8: σ(latency) — consistency signal
uint32 stream_fidelity_bps = 11; // L7: SSE done_chunks / total_chunks × 10000
int64 feedback_score_sum = 12; // L4: Σ feedback signals (+1/-1)
int64 feedback_count = 13; // L4: number of feedback signals this epoch
}
Governance weight parameters (new fields in CacheQualityParams):
// axis_weights[i] is the weight for Li in basis points. Sum must equal 10000.
// Default: [1000,1000,1500,1000,1000,500,1000,1000,1000,1000]
repeated uint32 axis_weights = 8;
// max_cache_entries bounds InMemoryCacheStore growth.
// Default: 50000. At 1.5KB/entry: ~75MB peak. Required for production nodes.
uint64 max_cache_entries = 9;
Scale constraint (honest)¶
InMemoryCacheStore currently has no entry limit. At mainnet scale (75K
inferences/epoch, 384-dim embeddings, MaxCacheAgeEpochs=10): peak ~1.15GB RAM
and O(75K) cosine scan per request.
max_cache_entries governance parameter (Phase 1) bounds this. With N=50,000:
peak ~75MB, scan O(50K) — acceptable on any modern node. The EvictExpired call at
each epoch boundary keeps the store bounded over time.
Related work¶
- PR [#859] — semantic cache infrastructure (this discussion depends on it)
- PR [#793] — EpochGroupCache: per-block epoch state (merge prerequisite for #859)
- PR [#703] — free inference security fix (merge prerequisite for #859)
- PR [#856] — Continuous PoC complete ([#821]): directly validates L0 axis.
ContinuousPoCis now live infrastructure; quality measurement (L0: compute stability, CV=0.35 measured) sits on top of this foundation. Timing is deliberate: PoC lands first, quality layer follows. - PR [#812] — StartInference/FinishInference performance (reduces hot-path cost on every inference, including cache HITs)
- PR [#789] — fund atomicity fix: L2 (correctness) axis tracks invalidation rate. Atomicity fixes reduce false invalidations, improving baseline L2 score.
- GiP [#840] — Prometheus exporter:
/admin/v1/cache/statsis Source A in the three-source cross-check triangle proposed there - GiP [#816] — Node Manager: k8s deployment standard that maximises cache hit rate organically through model specialization (M=1 per node)
- Discussion [#802] — design-first process: this GiP follows that process explicitly
- Issue [#820] — missed inferences: L2 (correctness) and L9 (completion rate) axes directly quantify the root cause
- Issue [#839] — log_format=json: 3× latency improvement; prerequisite for honest L8 (latency consistency) baseline measurements
Open questions for the community¶
-
Weight governance: who proposes initial
axis_weights? What's the amendment process when a new axis is added? -
L4 feedback incentive: should participants be rewarded (even nominally) for submitting feedback? Without incentive, adoption will be low.
-
L5 developer webhook: opt-in or opt-out default? What's the privacy model for outcome data?
-
SDK scope: should Phase 7 be a Gonka Labs project or a community-owned repository? What's the governance model for the SDK itself?
-
max_cache_entries default: 50,000 is conservative. Is there a preferred bound based on expected node hardware profiles?
-
ContinuousPoC integration: should
ContinuousPoCEpochSummary.effective_poc_weightbe part of L0 axis calculation, or remain a separate PoC track? (@akup, @Mayveskii)
Full design document with scores, routing simulation, and scenario matrix:
docs/specs/inference-quality-protocol.md in the PR [#859] branch.
💬 Комментарии (1)¶
Комментарий 1 — @gmorgachev¶
2026-03-06 20:41 UTC
Quality of inference — whether a response was useful, accurate, timely, or appropriate for the request type — has no equivalent protocol representation. It is invisible to the chain.
The quality of the response (in terms of LLM accuracy) is part of the security model itself: governance exactly defines which models are served then cross-validation verifies that. The validation process itself requires some improvement but the idea is to guarantee the identicall quality from all participant explicitly, not by feedback
Routing blindness — GetRandomExecutor distributes traffic uniformly regardless of which node is better at which task. A node specializing in code generation gets the same traffic as one optimized for translation.
Same point, it distributed only between workers who served the exatly same model. Host can't choose to serve differnt one by itself
The idea to measure performance in general is a good direction. But i feel that current proposal don't take into account how chain works now
↳ Ответ от @Mayveskii · 2026-03-08 13:32 UTC
RE
@gmorgachev Thanks, addressing each point:
1. Quality of response (LLM accuracy) and security model
We’re not replacing governance or cross-validation. They still define which models are allowed and verify identical results. In GiP #860, axes L0–L2 (compute stability, availability, correctness) are exactly what the chain already has (RTV, validation, weight stability). L4 (usefulness) and L5 (outcome) are additional signals on top: “was this result useful?” or “task resolved,” not a substitute for “all participants return the same output for the same model.”2. Routing and “host can’t choose a different model”
Agreed: the host doesn’t choose the model. In #869, GetQualityWeightedExecutor is intended to work inside the same model: the request is already bound to a model (as today), and the weight only affects which node among those serving that model gets the request. So traffic is still only between workers for the same model; the change is “among those, prefer the one with better L6/L8/L9 (reuse, latency, completion) for that model.”3. “Proposal doesn’t take into account how the chain works now”
In PR #859, CacheQualityWeight is wired into the existing flow: it’s added to baseCount in the same place as PoC weight (module/chainvalidation.go, settlement). No second settlement path — just an extra term in the same formula. Phases 0–6 in #869 build on that: extend proto (fields 8–13), report L7/L8 in QualityReporter, parse X-Inference-Feedback, route by quality among executors for the same model. So we’re explicitly building on top of the current chain logic, not beside it.What’s already done: We’ve validated the #860 hypothesis with a real setup: gonkalabs/gonka-agent (semantic cache, two participants, different workspaces). R-3 in docs/testing.md shows Participant B getting a partial hit (0.79) from A’s cache — same domain (Go race), different struct. That’s the “one participant structures requests → the other benefits from cache” scenario from this GiP; L6 reuse and time saved are measured. The L6/L8/L9 + X-Inference-Feedback middleware lives in gonkalabs/opengnk. So the proposal is not only on paper — it’s exercised in the agent → proxy → network path, while keeping the current chain and model-routing behavior.