🔄 Авто-синхронизация: из Discussion #1367 каждые 6 часов.

High-Availability Architecture¶

Автор: @a-kuprin · Категория: Proposals · Создано: 2026-06-25 08:16 UTC · Обновлено: 2026-06-28 15:57 UTC

📝 Описание¶

Make every node-side service horizontally scalable and rolling-update safe, with no single point of failure, and deliver a Kubernetes-ready deployment packaged as a Helm chart.

This proposal builds on the current architecture (../high-availability-architecture.md) and the binary-rollout design (../rolling-update.md). It does not change the inference-chain.

1. Goals¶

No single point of failure. Every node-side service runs ≥2 instances, ideally across machines.
Rolling updates with zero dropped in-flight work (see ../rolling-update.md).
Scale the read/serve path (chain queries, inference fan-out, PoC callbacks) horizontally.
Keep exactly-once chain effects. Even with many instances, each chain transaction and each block-driven action happens once.
Kubernetes-ready deployment via Helm. The HA decomposition above is the prerequisite for packaging the node stack as a Helm chart: one chart (or subcharts) per service, externalized shared dependencies (Postgres, NATS, Redis), and native K8s primitives for scaling, health, drain, and rolling updates (see §9).

What already meets the bar¶

Service	HA today
`proxy`	Immutable; run N behind a VIP/L4 LB
`edge-api`	Stateless; N instances + `edge-api-router` (round-robin)
`versiond` + `devshardd`	N instances on separate machines + `versiond-router` (sticky hash) on shared Postgres

What blocks HA today¶

decentralized-api (dapi) is a single monolithic process: its chain event listener / phase engine has no leader election, it embeds a per-process NATS, holds a local keyring for signing, and queries the chain directly. Two dapi instances would duplicate transactions and ML commands (see ../high-availability-architecture.md §5). This proposal restructures dapi so it can be made highly available.

2. Target architecture (overview)¶

                          ┌──────────────────────────────┐
            clients ────▶ │   proxy (N, behind L4 LB)     │
                          └───────────────┬───────────────┘
        ┌──────────────────────┬──────────┴───────────┬───────────────────────┐
        ▼                      ▼                      ▼                        ▼
 ┌─────────────┐      ┌────────────────┐     ┌────────────────┐      ┌────────────────┐
 │  edge-api    │      │ dapi: edge-srv │     │ dapi: node-mgr │      │ versiond[-router]│
 │  (N, HA)     │      │ (N, PoC/admin) │     │ (broker/PoC)   │      │  → devshardd    │
 │  HA chain    │      │  REST callbacks│     │  + leader      │      │  (shared PG)    │
 │  proxy+cache │      └───────┬────────┘     └───────┬────────┘      └────────────────┘
 │  + event hub │              │                      │
 └──────┬───────┘              │ publish chain msgs   │ publish chain msgs
        │ events (pub/sub)     ▼                      ▼
        │              ┌──────────────────────────────────────┐
        ├─────────────▶│        NATS (standalone, HA)          │  ◀── one queue for
        │              │  subjects: chain.tx, events.*, ...    │      all chain-bound msgs
        │              └───────────────────┬──────────────────┘
        │                                  ▼  (single consumer / msg)
        │                        ┌──────────────────────┐
        │                        │   signer service     │  signs with warm key,
        │                        │   (N, queue group)   │  sends tx to chain
        │                        └──────────┬───────────┘
        │                                   ▼
        ▼                              inference-chain
   Redis (shared edge-api state:        (gRPC + RPC)
   leader lock, event cursor, cache)

Two pillars:

A. edge-api becomes the highly-available chain access + event hub. A separate edge-api tier gathers block events from the chain, publishes them to NATS (and other subscribers), and caches chain queries so node services do not each open their own gRPC/RPC subscriptions. The same surface can later be reused by dashboards and monitoring (read APIs + event stream) without coupling observability to dapi or devshardd. Redis backs edge-api's shared state and leader election.
B. dapi is decomposed into independently-scalable services around a standalone HA NATS queue, a stateless signer service, Postgres as the only stateful backend, and stateless REST/echo workers.

3. Pillar A — edge-api as the HA event hub & chain cache¶

Purpose of a separate edge-api¶

edge-api is split out as its own service so the node has one chain-facing tier with two jobs:

Block events — subscribe to the inference-chain once (CometBFT NewBlock + per-tx events), normalize them, and transmit the stream to NATS and other subscribers (dapi node-manager, devshardd, PoC workers).
Query cache — serve and cache chain read APIs (participants, epochs, params, escrows, etc.) so every consumer does not dial gRPC/RPC independently.

That separation keeps dapi and devshardd focused on their domain logic while edge-api owns how the node talks to the chain. The same HTTP/gRPC read surface and event fan-out can be reused later by dashboard and monitoring systems (status pages, ops tooling, external observers) without embedding chain clients in each product binary.

Today edge-api is a stateless read-only proxy for 22 Tier A routes. We extend it to be the chain-access layer for the whole node.

3.1 Move the event listener into edge-api¶

Relocate the chain event listener that lives in dapi (decentralized-api/internal/event_listener/) into edge-api.
edge-api subscribes to the chain (CometBFT WebSocket NewBlock + RPC BlockResults per-tx events) and re-publishes normalized events to NATS and other consumers (dapi services, devshardd) via pub/sub.
Consumers (dapi node-manager, PoC services, devshardd) subscribe to edge-api events instead of opening their own chain subscriptions. This removes N independent chain subscriptions and centralizes block processing.

3.2 Leader election (only one instance triggers events)¶

edge-api scales to N instances, but block-driven side effects must fire once. So:

Every instance stays in sync (each can serve queries and hold a warm event cursor), but only the elected leader advances the canonical block cursor and emits the authoritative event stream.
Redis holds the leader lock (e.g. SET NX PX lease with renewal) and the shared event cursor (last_processed_height) so a new leader resumes exactly where the old one stopped — no gaps, no replays.
Emitted events are propagated to all instances (and downstream services) via pub/sub, so followers and consumers see the same stream the leader produced. On leader loss, another instance takes the lock within the lease TTL and continues from the Redis cursor.

3.3 Redis as edge-api shared state¶

Redis use	Why
Leader lock (lease + renew)	Single active event emitter
Event cursor (`last_processed_height`)	Gap-free failover
Chain query cache (optional)	Reduce duplicate chain gRPC load; TTL per route
Fan-out / pub-sub of events	Propagate the leader's event stream to all instances and subscribers

edge-api thus becomes a highly-available proxy + cache for the inference-chain, and the single source of chain events for the node.

3.4 Consumers stop querying the chain directly¶

dapi no longer queries the inference-chain directly. It uses HA edge-api for chain reads and subscribes to edge-api for events. This shrinks dapi to its unique responsibilities (below).
devshardd likewise subscribes to edge-api events (escrow created/settled, new block/phase) rather than maintaining its own chain WebSocket, reducing per-child chain connections. (devshardd keeps its own gRPC tx path for disputes, or routes them through the signer queue — see §5.)

4. Pillar B — decentralized-api becomes single-purpose¶

After Pillar A, dapi sheds chain-query and event-subscription duties. Its remaining unique responsibilities are:

Node manager (broker: ML node lifecycle per epoch phase).
Admin panel (admin REST: node CRUD, model registration, setup report, etc.).
PoC / cPoC handler + scraper (artifact ingest, commit worker, off-chain validation, proof serving).

These become independently deployable, mostly-stateless services. The only mutable backends are Postgres (shared) and the NATS queue.

4.1 Service decomposition¶

New service	Responsibility	Scaling	State
edge-srv (REST workers)	Echo HTTP for PoC callbacks (`/v2/poc-batches/...`) and admin REST events	Multi-instance (immutable)	none (writes to Postgres / publishes to NATS)
node-manager	Broker reconciliation + phase engine reactions (PoC stage commands, validation sampling)	Leader-elected (single active driver)	Redis lock + Postgres
signer	Sign chain messages with the warm key and broadcast	Multi-instance, NATS queue group (one message consumed once)	warm keyring only
PoC services	Artifact store, commit worker, off-chain validation, proof client	Mixed (callbacks scale; commit driven by node-manager leader)	Postgres

4.2 Standalone HA NATS queue¶

Replace the embedded per-process NATS (decentralized-api/internal/nats/server/server.go) with a standalone, clustered NATS (JetStream) shared by all instances.

Every chain-bound message is published to NATS (subject e.g. chain.tx), carrying the message and metadata. Producers are any service that needs to write to chain (node-manager, PoC commit, edge-srv, devshardd disputes).
The queue is the single, durable, ordered-enough path to the chain. It survives instance restarts and decouples producers from the signer.

4.3 Signer service (warm-key signing, exactly-once consume)¶

The signer is a NATS queue-group consumer of chain.tx: with a queue group, each message is delivered to exactly one signer instance, so N signers share the load but never double-sign the same message.
The signer holds the warm key (Cosmos keyring), wraps in authz.MsgExec with feegrant from the cold account where applicable (current model in cosmosclient/tx_manager), signs, and broadcasts to the chain.
Broadcast/observe/retry state moves to the durable NATS streams (txs_to_send / txs_to_observe equivalents) so any signer instance can pick up retries. Idempotency keys (e.g. inference id + msg type) guard against duplicate submission across retries/failover.
Because signing is isolated behind the queue, the warm key lives only in the signer — other services never need the keyring.

4.4 Postgres as the only HA data backend¶

All mutable state that must survive instance loss lives in Postgres (payloads, stats, PoC artifacts/commits metadata, config/cursors as needed). Per-process SQLite KV (e.g. dapi apiconfig last_processed_height) moves to Postgres/Redis so any instance is interchangeable.
This mirrors the devshardd rule: multi-instance ⇒ Postgres (../high-availability-architecture.md §4).

4.5 Stateless echo workers (PoC callbacks + admin)¶

The Echo HTTP layer for PoC callbacks and admin REST is immutable: it only reads/writes Postgres or publishes to NATS. Therefore it can run N instances behind the proxy with no coordination.
The only operations needing single-execution semantics (phase-driven stage commands) belong to the leader-elected node-manager, not the echo workers.

5. Rolling updates¶

Rolling updates apply per service and reuse the design in ../rolling-update.md. The HA stack depends on the same drain semantics at two layers: binary swap inside a live supervisor, and whole-host evacuation behind the sticky router.

Rolling-update concepts (summary)¶

The rolling-update plan defines how we roll out new binaries without dropping in-flight work. Three operator guarantees:

Requests already accepted by an old instance may finish — we do not kill while work is still running.
A new instance must be ready before it receives traffic.
After the new instance is reachable, new requests go to it; the old instance drains until idle, then exits.

Blue/green + drain inside versiond (Part 1 §1.1). When governance publishes a same version name, new sha256 binary, versiond downloads the new devshardd, starts it on a new port while the old child keeps serving, waits for GET /ready (not just TCP accept), atomically swaps the in-process route table so new requests hit the new child, marks the old child draining (out of the route table but still alive), polls in-flight count until zero (or a drain timeout), then SIGTERM with a long shutdown grace. Old and new can overlap only when durable state lives in shared Postgres — SQLite is single-writer and cannot support concurrent children (Part 1 §1.2).

Two drain layers — do not conflate (Part 1 §1.7–§1.8).

Event	Layer	Router involved?
Same name, new sha256 (governance binary update)	versiond blue/green + devshardd child drain	No — `versiond-router` upstream unchanged
versiond host removal, replace, or supervisor upgrade	`versiond-router` host evacuation	Yes — mark upstream `down`, drain pinned escrows, then stop the host

During a devshardd binary swap, sticky routing is unchanged: the router still points at versiond-N:8080; only the child port inside versiond swaps. Router drain is for when the versiond process itself must leave the pool (scale-down, host maintenance, versiond binary upgrade).

Signals the plan adds to devshardd: /healthz (liveness), /ready (readiness gate for route swap), /drain/status (in-flight work), and configurable DEVSHARD_SHUTDOWN_GRACE so long SSE streams are not cut at 5s.

Kubernetes mapping (Part 2). The same guarantees map to RollingUpdate (maxUnavailable: 0, maxSurge: 1), readinessProbe → /ready, preStop (drop from endpoints before SIGTERM), and terminationGracePeriodSeconds aligned with shutdown grace. Pod/host evacuation maps to Part 1 §1.8 (router drain), not the in-versiond binary swap.

How rolling updates apply in this HA proposal¶

Stateless services (edge-api, edge-srv echo workers, signer): standard rolling update — bring a new instance up, health-check, route to it, drain the old. Behind their routers / queue groups this is transparent.
Leader-elected services (edge-api emitter, node-manager): a rolling update may trigger a leader handoff; the Redis lease + cursor make this safe (new leader resumes from the cursor).
versiond / devshardd (same version, new binary): blue/green + drain inside versiond, with the shared Postgres making old+new overlap correct — see ../rolling-update.md §1.
versiond host replace, scale-down, or maintenance: drain at versiond-router (mark upstream down, wait for pinned escrows idle, then stop the host) — see ../rolling-update.md §1.8.
NATS / Redis / Postgres: run in their own HA/cluster modes; updated with their native rolling procedures, independent of app rollouts.

6. Kubernetes & Helm (deployment target)¶

Docker Compose overlays (local-test-net/, deploy/join/) prove multi-instance topology today; production HA should land on Kubernetes with a Helm chart that encodes the same service boundaries as this proposal.

A Helm chart alone does not make a monolith HA.

Intended chart shape (high level)¶

Chart / workload	K8s notes
`edge-api`	`Deployment` + `Service`; `readinessProbe` → `/healthz`; HPA-friendly
`edge-api` event hub	Same image/chart; leader via Redis; subscribers use cluster DNS or NATS
`versiond`	`Deployment` + sticky `Service` or Ingress consistent-hash on escrow id
`devshardd`	Child of versiond in-process today; chart may deploy versiond only
`signer`	`Deployment`; NATS queue-group consumer; one message consumed once
`dapi` services	Split Deployments: echo workers (scale out), node-manager (leader)
`proxy`	Optional Ingress / Gateway in front of edge-api, dapi, versiond paths
Dependencies	Postgres, NATS, Redis as subcharts or external endpoints in `values.yaml`

Helm deliverable¶

Single umbrella chart (or app-of-apps) for a Gonka node: enable/disable HA overlays (multi edge-api, multi versiond, external NATS/Redis) via values.
Documented values for PGHOST, NATS URL, Redis URL, chain gRPC/RPC URLs, replica counts, resource limits, and graceful shutdown timeouts aligned with inference/SSE duration.
CI: helm template / helm lint on chart changes; optional kind smoke.

Compose remains the developer / integration-test path; Helm is the target for production Kubernetes once Pillars A–B and phasing steps 1–5 are in place.

7. Notes¶

Redis vs NATS JetStream KV for the cursor/lock — avoid adding Redis if JetStream KV suffices. (Proposal assumes Redis per the stated direction.)
Cache invalidation for edge-api chain cache around epoch/phase boundaries. This should be reused

💬 Комментарии (1)¶

Комментарий 1 — @gmorgachev ¶

2026-06-27 18:12 UTC

Hi @a-kuprin !

Couple questions:

Target architecture (overview)

Which serveses are expected to be publicly available?Based on graph i expect all 4:

edge-api

dapi: edge-srv

dapi: node-mgr

versiond

Do we really need all of them to be not private?

Pillar B — decentralized-api becomes single-purpose

How do you think to scale service which handles PoC callback? It uses high performant local storage for PoC artifacts. This storage is merkle-tree like and requires high performance for r/w

Of we want single instance to be responsible for the whole single PoC cycle?

Rolling updates

Definitely agree with part 1, let's do it I think we can postpone part 2

Kubernetes & Helm (deployment target)

I support the idea to allow kubernetes deployment. But i'd keep simplest single-instance-per-each-service as base deploy approach in deploy/join. To not overcomplicate onboarding for small miners

Overall, that's a solid long term goal. I'd try to split it in smaller steps for implementation and phases of deploy. And keep them compartible with deployed version: - rolling updated (i think highest) - separate event listered from node-manager / etc - ....

↳ Ответ от @a-kuprin · 2026-06-28 15:57 UTC

Which serveses are expected to be publicly available?

I think it's enough to have edge-api (that actually takes public endpoints from decentralized-api) and versiond dapi should be private as it is admin tool and node-mgr that is used internally by devshardd spawned by versiond

2. How do you think to scale service which handles PoC callback?

I think this should be analyzed deeper, I didn't took into account this merkle-tree like storage use

Kubernetes & Helm (deployment target)

I assume it to be additional feature, not the replacement of docker compose deployment. When high-availability refactoring is ready it is quite easy using agent rewrite compose to helm. And I think even small miners can benefit from kubernetes, as full installation and update (if DevOps engeneer is familiar with kube and tooling like ArgoCD) could be even easier and smoother