🔄 Авто-синхронизация: из Discussion #816 каждые 6 часов.

Gonka Node Manager — Automated Node Deployment, Updates, and Monitoring¶

Автор: @ochenUmnayaKatyshka · Категория: Proposals · Создано: 2026-02-26 11:49 UTC · Обновлено: 2026-03-09 12:33 UTC

📝 Описание¶

Executive Summary¶

Running Gonka nodes currently requires manual CLI-based installation, configuration, and ongoing maintenance. For experienced operators, initial setup typically takes several hours per node. For less experienced participants, the process often stretches to one or two days — or results in abandonment before a node is ever launched.

This friction actively filters out potential operators, concentrating power among technically advanced users and slowing organic decentralization.

Gonka Node Manager requests $80,000 USD equivalent from the Community Pool to build a production-ready MVP that automates node deployment, updates, and monitoring.

Note: On-chain governance vote (Community Pool spend) will be submitted as a separate transaction once the contract is deployed. This issue tracks the technical proposal and implementation plan.

Problem Statement¶

Current node operation requires: - Manual CLI setup taking hours to days per node - SSH access and server-level configuration for every update - No automated monitoring or health visibility - No streamlined process for attaching ML nodes to network nodes

This effectively limits participation to a narrow group of technically advanced users, concentrating operational power and slowing organic decentralization.

Proposed Solution¶

Gonka Node Manager — a unified tool that allows node operators to:

Deploy network and ML nodes in a consistent, automated way
Keep nodes up to date without manual intervention
Attach and operate ML nodes alongside network nodes
Observe basic node status and health without logging into servers

The system operates on top of user-provided infrastructure (physical or virtual) and fits naturally into a decentralized network model.

Architecture Overview¶

Components¶

Component	Role
Control Plane (Web UI + API)	User auth, node management, issues tasks to agents, displays status
Node Agent (host daemon)	Installed once per host, manages local Docker Compose stacks, communicates outbound only
Stack Bundles	Deployment definitions: services, ports, health checks, references to official Gonka images

Core User Flows¶

Node bootstrap — user adds node in UI → agent installed once on server → node available
Deployment & updates — automated via preconfigured stable release path, no user intervention
Network/ML attachment — ML node attached to network node, firewall rules applied automatically, connectivity validated
Warm key workflow — warm key generated locally on node, only public key exposed, cold-key signing stays with operator

Security Model¶

SSH access stays entirely under operator control — Control Plane never stores credentials or uses SSH
No inbound management ports required on nodes
Agent executes only approved, predefined operations — no arbitrary remote commands
Stack bundles verified before application
Cold keys never leave the operator's device

This preserves node sovereignty and aligns with Gonka's decentralization principles.

MVP Scope¶

Included¶

Host-level node agent
Deployment of network and ML nodes
Automatic updates via preconfigured stable release path
Network/ML attachment with firewall automation
Warm key generation and binding workflow
Basic status and health reporting

Explicitly Out of Scope¶

Advanced metrics and observability stacks
Arbitrary remote command execution
Manual version selection or multiple update channels
Role-based access control
Automated rollback mechanisms (except those recommended by Gonka developers)
On-chain enforcement of updates
Remote administrative access

Implementation Plan¶

Phase 1 — MVP Installation (45–50% of budget)¶

Outcome: Network and ML nodes can be deployed and operate end-to-end - Node agent: deploy, attach, health — 160–200h - Control Plane core APIs — 80–100h - Architecture & core design — 40–60h - Minimal UI — 40–60h - Integration testing — 40–60h

Phase 2 — MVP Auto-Updates (20–25% of budget)¶

Outcome: Running nodes update themselves automatically - Bundle versioning & signatures — 40–60h - Agent update logic — 40–60h - Control Plane update coordination — 20–30h - Update testing — 20–30h

Phase 3 — MVP Monitoring (15–20% of budget)¶

Outcome: Operators can observe node status and health - Agent health reporting — 30–40h - Backend status aggregation — 20–30h - UI status views — 40–50h

Phase 4 — Stabilization & Technical Debt (10–15% of budget)¶

Outcome: Stable, documented, production-ready MVP - Bug fixes and edge cases - Security hardening - Documentation - Final testing

Total estimated effort: 650–880 engineering hours

Budget¶

Requested: $80,000 USD equivalent (one-time, from Community Pool)

Funding released in stages tied to phase completion. Cost overruns covered by the team at its own expense — no retroactive compensation requests will be made.

Accountability¶

All progress tracked in a single public GitHub repository
progress.md updated after each phase with written summary, deliverables, and artifact links
Code, docs, release notes, and demo materials committed per phase
Repository URL announced immediately after proposal approval

Expected Outcome¶

Easier onboarding for new node operators
Increased number of independent nodes
Reduced downtime caused by manual configuration
Stronger and more sustainable decentralization of the Gonka network

💬 Комментарии (4)¶

Комментарий 1 — @Aktum1 ¶

2026-02-27 22:43 UTC

Hey! Have you checked out what these guys have already built in this direction? https://gonka.gg/node-setup

What do you think about their implementation?

↳ Ответ от @ochenUmnayaKatyshka · 2026-03-04 13:10 UTC

Our proposal focuses on a different set of operational requirements:

Lifecycle Management: Beyond the initial install, we are building a system for automated updates and configuration changes without requiring manual SSH sessions for every release.

Centralized Visibility: The goal is to provide a dashboard where the status of both Network and ML nodes can be monitored in one place, rather than checking individual server logs.

Orchestration Logic: We are automating the connectivity and firewall rules needed to attach ML nodes to Network nodes, which currently involves several manual steps.

Essentially, while a script handles the deployment process, our tool is designed to manage the node's long-term operation and connectivity automatically.

Комментарий 2 — @Aktum1 ¶

2026-02-27 22:52 UTC

Also, take a look at this node monitoring Telegram bot: @GonkaHubBot.

↳ Ответ от @Aktum1 · 2026-02-27 22:59 UTC

I’ve been trying to set up my node for a month now.

It’s not exactly a simple setup — 8× A100 40GB. After one of the recent code changes, these GPUs were practically restricted. But I’ve already paid for the server for a month, so now I’m using it to fully test and fine-tune the node deployment process.

I even got a paid Cursor subscription specifically for this. Really hoping I’ll manage to get everything running properly.

If I succeed, I’ll definitely share the full experience with others.

That said, I’m not entirely sure whether node setup can realistically be turned into a “one-click service.” There are so many edge cases and нюances — can a service really account for all of that?

To me, a model based on an AI agent that deeply understands the infrastructure and has access to all up-to-date documentation sounds more scalable than building a rigid setup service.

But I could be wrong.

At the end of the day, we’re basically trying to solve the same problem — just using different tools.

By the way, I’m documenting my entire journey in the Knowledge Base: https://gonka-data-base.gitbook.io/gonka-data-base-en/hosts/node.-testing

Maybe some of it could be useful for your project.

↳ Ответ от @ochenUmnayaKatyshka · 2026-03-04 13:10 UTC

The 'one-click' challenge is exactly why our proposal moves away from simple scripts toward a Node Agent architecture.

A static setup often fails when hardware or dependencies change. Our approach uses the Agent as a local management layer to handle the environment, Docker stacks, and connectivity. The goal of this architecture is to provide a system that accounts for infrastructure nuances and maintains node operation through updates, rather than just performing an initial installation.

The project is designed to automate these manual processes and handle the edge cases that typically arise during long-term node maintenance.

↳ Ответ от @Mayveskii · 2026-03-05 12:58 UTC

Проблема "работы в один клик" — именно поэтому в нашем предложении мы отходим от простых скриптов в пользу архитектуры Node Agent.

Статическая настройка часто дает сбой при изменении оборудования или зависимостей. Наш подход использует агент в качестве локального уровня управления для обработки среды, стеков Docker и подключения. Цель этой архитектуры — предоставить систему, которая учитывает нюансы инфраструктуры и поддерживает работу узлов при обновлениях, а не просто выполняет первоначальную установку.

Проект призван автоматизировать эти ручные процессы и обрабатывать нестандартные ситуации, которые обычно возникают при длительном обслуживании узлов.

Unified operator stack: delivery, control, reproducibility Combining Node Manager (#816) and Prometheus exporter (#840) with a thin orchestration layer (e.g. Airflow, optionally n8n) gives a single stack that is not that heavy but covers the main operator needs: delivery, control, and reproducibility. Delivery: Deploy and update nodes via Node Manager; no need to hand-hold each host. Control: Same metrics (block height, POC weight, status) for everyone via the exporter + Prometheus + Grafana — hosts see what we see. Reproducibility: Airflow turns procedures into DAGs: health checks, update windows, report generation, cleanup, even ticket/workflow-style steps (e.g. “after alert → create task → run remediation”). Any scenario you can script becomes repeatable and auditable. So you get one place for “how we deploy”, “how we monitor”, and “how we react and rerun”. The stack is modular: minimal is exporter + Prometheus + Grafana; Node Manager and Airflow (and optionally n8n for event-driven bits) add on top for those who want automation and reproducibility. Worth doing? Yes. It directly tackles “hosts can’t easily configure and monitor like we do”: same tooling, same visibility, and the same ability to encode and replay scenarios instead of ad‑hoc SSH and one-off scripts. If the community wants to invest in operator experience and reproducibility, this is a concrete way to get there.

↳ Ответ от @Mayveskii · 2026-03-06 14:23 UTC

Проблема "работы в один клик" — именно поэтому в нашем предложении мы отходим от простых скриптов в пользу архитектуры Node Agent. Статическая настройка часто дает сбой при изменении оборудования или зависимостей. Наш подход использует агент в качестве локального уровня управления для обработки среды, стеков Docker и подключения. Цель этой архитектуры — предоставить систему, которая учитывает нюансы инфраструктуры и поддерживает работу узлов при обновлениях, а не просто выполняет первоначальную установку. Проект призван автоматизировать эти ручные процессы и обрабатывать нестандартные ситуации, которые обычно возникают при длительном обслуживании узлов.

Unified operator stack: delivery, control, reproducibility Combining Node Manager (#816) and Prometheus exporter (#840) with a thin orchestration layer (e.g. Airflow, optionally n8n) gives a single stack that is not that heavy but covers the main operator needs: delivery, control, and reproducibility. Delivery: Deploy and update nodes via Node Manager; no need to hand-hold each host. Control: Same metrics (block height, POC weight, status) for everyone via the exporter + Prometheus + Grafana — hosts see what we see. Reproducibility: Airflow turns procedures into DAGs: health checks, update windows, report generation, cleanup, even ticket/workflow-style steps (e.g. “after alert → create task → run remediation”). Any scenario you can script becomes repeatable and auditable. So you get one place for “how we deploy”, “how we monitor”, and “how we react and rerun”. The stack is modular: minimal is exporter + Prometheus + Grafana; Node Manager and Airflow (and optionally n8n for event-driven bits) add on top for those who want automation and reproducibility. Worth doing? Yes. It directly tackles “hosts can’t easily configure and monitor like we do”: same tooling, same visibility, and the same ability to encode and replay scenarios instead of ad‑hoc SSH and one-off scripts. If the community wants to invest in operator experience and reproducibility, this is a concrete way to get there.

One concrete number for why model-specialization deployment matters for the cache layer in PR #859:

hit_rate = repeat_fraction × (1/M) × (1 − stream_fraction)

M=571 (Qwen3-32B, shared across 571 nodes): hit_rate = 0.000473 M=1 (unique model per node via Node Manager): hit_rate = 0.270

Specialization multiplier: 571× improvement in cache hit rate. At M=1, MaxWeightFractionBps cap (+30% epoch weight) is reached.

Node Manager's "one model per node" deployment pattern is the infrastructure prerequisite for the cache economics to work at network scale. The two proposals are complementary: Node Manager creates the conditions, semantic cache captures the reward.

Data source: live network topology, gonka.gg/api/public (1,282 ML nodes, 5 models measured, epoch 190).

Комментарий 3 — @andrey055 ¶

2026-03-01 17:47 UTC

In the Telegram discussion, some people are saying that this improvement may not be very relevant right now. It seems there are more pressing tasks at the moment, especially since alternative solutions already exist. However, we truly wouldn’t like to lose your willingness to build projects for Gonka.

Perhaps you could consider taking a look at a more relevant task?

↳ Ответ от @ochenUmnayaKatyshka · 2026-03-04 12:36 UTC

You mentioned an existing list of tasks. We would be happy to take a look — could you please point us to where we can find it?

↳ Ответ от @tcharchian · 2026-03-04 18:03 UTC

You mentioned an existing list of tasks. We would be happy to take a look — could you please point us to where we can find it?

We’ve recently published three new proposals on GitHub Discussions, and we’d really any feedback

https://github.com/gonka-ai/gonka/discussions/801

https://github.com/gonka-ai/gonka/discussions/800

https://github.com/gonka-ai/gonka/discussions/802 (for this proposal, I also opened an issue where design and implementation ideas can be discussed https://github.com/gonka-ai/gonka/issues/821)

If you have thoughts, concerns, or improvement ideas, please share them directly in the comments

Also, I opened an additional discussion to highlight three key areas where help is needed https://github.com/gonka-ai/gonka/discussions/817. These are not yet framed as standalone tasks, but they point to three problems where your involvement, perspective, and suggestions would be extremely valuable.

https://github.com/gonka-ai/gonka/issues/818

https://github.com/gonka-ai/gonka/issues/820

https://github.com/gonka-ai/gonka/issues/819

Also, you can filter issues with "up-for-grabs" label https://github.com/gonka-ai/gonka/issues?q=is%3Aissue%20state%3Aopen%20label%3Aup-for-grabs and see if there are any open tasks that have no assignee.

More on bounty program: https://gonka.ai/FAQ/#bounty-program

Комментарий 4 — @SegovChik ¶

2026-03-09 11:41 UTC

So here is working solution with some limitations(Still in development, so current functional is limited to provision network+mlnode on the same server) https://github.com/inc4/gonka-nop/releases/tag/v0.1.8-rc1. Download binary, execute ./gonka-nop setup and here we are.

↳ Ответ от @SegovChik · 2026-03-09 11:44 UTC

and here is our roadmap of what was done.

Implementation Milestones¶

Milestone 1: Core CLI Framework ✅ COMPLETE¶

1.1 Project Scaffolding¶

Initialize go.mod with dependencies

Create cmd/gonka-nop/main.go entry point

Create internal/cmd/root.go with cobra root command

Add version, status, reset, gpu-info command stubs

Create internal/cmd/setup.go (setup command with flags)

1.2 UI Utilities (needed for phases)¶

Create internal/ui/output.go (colored messages: Info, Success, Warn, Error)

Create internal/ui/spinner.go (progress spinner wrapper)

Create internal/ui/prompt.go (survey wrapper for interactive prompts)

1.3 State Management¶

Create internal/config/state.go (State struct)

Implement Save() and Load() methods

1.4 Phase System¶

Create internal/phases/phase.go (Phase interface + runner)

Create mocked phases for demo:

01_prerequisites.go (Docker, CUDA check - mocked)

02_gpu_detection.go (GPU detection - mocked)

03_network_select.go (network selection prompt)

04_key_management.go (key workflow - mocked)

05_config_generation.go (generate configs - mocked)

06_deploy.go (docker compose - mocked)

1.5 Integration & Demo¶

Wire setup command to phase runner

Test full CLI compilation (go build ./...)

Run demo: gonka-nop setup (shows full mocked flow)

1.6 Status Command¶

Create internal/status/status.go (NodeStatus struct, fetch functions)

Create internal/status/display.go (formatted output)

Implement 3 sections: Overview, Blockchain, MLNode

Support --mocked flag for demo

Implement gpu-info command with --mocked

Milestone 2: Test Coverage (Tests First) 🔄 IN PROGRESS¶

Write tests for existing code before adding new features. Target: 70%+ coverage.

2.1 Config Package Tests (target: 90%+)¶

state_test.go - NewState, Save, Load, Reset, MarkPhaseComplete

Add edge case tests (invalid JSON, permission errors)

2.2 Phases Package Tests (target: 80%+)¶

gpu_detection_test.go - recommendConfig, FormatGPUSummary

prerequisites_test.go - Mock docker/nvidia checks

config_generation_test.go - Template output validation

phase_runner_test.go - Phase execution flow

2.3 Status Package Tests (target: 70%+)¶

status_test.go - Mock HTTP responses, parse validation

display_test.go - Output formatting

2.4 UI Package Tests (target: 50%+)¶

output_test.go - Color formatting

spinner_test.go - Basic functionality

prompt_test.go - Mock survey responses (harder to test)

2.5 Integration Tests¶

cmd/setup integration test with mocked phases

End-to-end state persistence test

Milestone 3: GPU & Prerequisites (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

Real system detection replacing mocked implementations. Validated on mainnet (8x A100 SXM4 80GB).

3.1 GPU Detection (real nvidia-smi) ✅ COMPLETE¶

Parse nvidia-smi --query-gpu=index,name,memory.total,driver_version,pci.bus_id --format=csv

Detect GPU architecture (sm_80=A100, sm_86=A40/RTX3090, sm_89=L40/RTX4090, sm_90=H100/H200, sm_100=B200/B300, sm_120=RTX5090) — GPUArchFromName() in gpu_parser.go

NVLink/PCIe topology detection — name-based (H100/H200/A100 = NVLink, others = PCIe). Full nvidia-smi topo -m not yet implemented.

PCIe bandwidth warning for multi-GPU setups without NVLink

Auto-select MLNode image tag based on architecture (standard / blackwell) — selectMLNodeImage()

3.2 Docker & Runtime Checks ✅ COMPLETE¶

Docker availability and version check — ParseDockerVersion() in gpu_parser.go

NVIDIA Container Toolkit detection (nvidia-ctk --version)

CUDA verification inside Docker container (docker run --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 nvidia-smi) — checkCUDAInDocker()

Docker Compose v2 check (docker compose version) — ParseDockerComposeVersion()

3.3 NVIDIA Driver Installation & Validation ✅ MOSTLY COMPLETE¶

Detect driver state: not installed / installed / version mismatch — checkNVIDIADriver()

Pre-check safety: Secure Boot status, kernel headers available, distro compatibility

Offer auto-install with user confirmation — installNVIDIADriver() via apt

Install Fabric Manager (nvidia-fabricmanager-570) if multi-GPU NVLink detected — checkFabricManager() with auto-start + install prompt

Compare userspace lib version vs kernel module version vs Fabric Manager version — 3-way consistency check in checkDriverConsistency()

Detect unattended-upgrades package and warn about NVIDIA driver auto-updates — suggests apt-mark hold nvidia-driver-*

Verify CUDA version compatibility with detected driver

Warn about reboot requirement after kernel module install/update

3.4 System Prerequisites ✅ MOSTLY COMPLETE¶

Linux distro detection (Ubuntu, Debian, CentOS, Amazon Linux) — ParseOSRelease() in gpu_parser.go

Disk space pre-check (warn if < 250 GB free) — ParseDiskFreeGB(), ParseLsblkJSON()

NVIDIA Container Toolkit auto-installation (with user confirmation)

Docker runtime configuration (nvidia-ctk runtime configure --runtime=docker)

Port availability check (5000, 8000, 26657 external; 5050, 8080, 9100, 9200 internal)

Milestone 4: Configuration (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

Generate production-ready configs with security and performance defaults learned from validators. Validated on mainnet deploy.

4.1 TP/PP Recommendation Algorithm ✅ COMPLETE¶

Model-aware TP/PP calculation (Qwen3-235B, Qwen3-32B, QwQ-32B) — recommendConfig() with 4-tier VRAM logic

gpu-memory-utilization recommendation: 0.88-0.94 based on VRAM headroom (NOT 0.99)

max-model-len calculation based on remaining VRAM after model loading

Auto-add --kv-cache-dtype fp8 for tight VRAM configs (e.g., 8x A100 40GB with 235B)

PP not set in vLLM args — MLNode runner auto-calculates from GPUs/TP. No --quantization fp8 either (causes MoE alignment errors)

4.2 node-config.json Generation ✅ COMPLETE¶

Template-based generation with go:embed

host field validation (no http:// prefix -- common mistake from chat)

hardware field auto-populated from GPU detection

max_concurrent recommendation based on GPU config

Model selection with explicit declaration (required for PoC v2)

4.3 config.env Generation ✅ COMPLETE¶

Template-based with validated fields

MODEL_NAME environment variable (mandatory after v0.2.8)

VLLM_ATTENTION_BACKEND = FLASHINFER for ALL architectures (3.0.12+ standard, old FLASH_ATTN rule obsolete)

4.4 docker-compose.yml Generation ✅ COMPLETE¶

Template-based with go:embed

Port security: bind internal ports to 127.0.0.1 (5050, 8080, 9100, 9200)

Public ports: 5000 (P2P), 8000 (API via proxy), 26657 (RPC -- via proxy only after DDoS hardening)

MLNode image tag selection based on GPU architecture detection

4.5 DDoS Protection Defaults ✅ COMPLETE¶

Proxy service config: GONKA_API_BLOCKED_ROUTES=poc-batches training

Proxy service config: GONKA_API_EXEMPT_ROUTES=chat inference

Default: DISABLE_CHAIN_API=true, DISABLE_CHAIN_RPC=true, DISABLE_CHAIN_GRPC=true

RPC port (26657) accessible only via proxy, not directly exposed

4.6 Sync & Pruning Configuration ✅ COMPLETE¶

Auto-configure persistent_peers in config.toml with known-good peers

Generate app.toml with pruning: custom, keep-recent=1000, interval=100

Set GENESIS_SEEDS and SEED_API_URL to reliable endpoints

Enable state sync by default (SYNC_WITH_SNAPSHOTS=true)

Milestone 5: Deployment (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

Deploy containers with real orchestration, security hardening, and sync monitoring. Validated on mainnet + testnet.

5.1 Docker Compose Orchestration ✅ COMPLETE¶

Multi-compose file handling (transparent -f docker-compose.yml -f docker-compose.mlnode.yml)

Automatic source config.env + sudo -E for environment variable propagation — internal/docker/env.go + compose.go

Image pull with progress display — docker compose pull

Container startup with ordered dependencies — network node first, then ML node

5.2 Firewall Configuration ⚠️ (from chat: Docker bypasses UFW)¶

Detect Docker's iptables behavior

Configure DOCKER-USER chain rules:

Allow established connections

Allow public ports (5000, 8000)

Whitelist known Gonka seed peers

DROP all other inbound to Docker containers

Offer iptables-persistent integration for rule persistence

IPv4/IPv6 resolution check for vLLM health endpoint (prevent restart loop)

5.3 Key Management (both workflows) ✅ COMPLETE¶

Quick workflow: generate all keys on server — 04_key_management.go

Secure workflow: accept account pubkey, generate consensus + ML keys — --account-pubkey flag

grant-ml-ops-permissions automation — 07_registration.go

Key backup guidance

5.4 Model Weight Download ✅ COMPLETE¶

HuggingFace model download with progress bar — standalone gonka-nop download-model command + deploy phase

Resume support for interrupted downloads — uses docker run with HF_HOME mount

SHA256 verification after download

Pre-download into HF_HOME before container startup

5.5 Health Checks & Sync Monitoring ✅ COMPLETE¶

Container health verification (all services running) — polls /admin/v1/setup/report

Blockchain sync progress with block lag display — polls Tendermint RPC /status, shows block height progress, 30min timeout

Wait for sync completion before registration (configurable timeout)

Port accessibility check (external ports reachable) — PUBLIC_URL reachability check planned

Milestone 6: Operations ⚠️ NEW (from chat: day-2 is 90% of operator time)¶

Post-deployment commands for ongoing node management. This entire milestone was missing from the original plan and is driven by validator chat analysis showing operators spend most time on operations, not setup.

6.1 gonka-nop status (real implementation) ✅ COMPLETE¶

Blockchain: block height, sync status, blocks behind, catching_up flag

Epoch: current epoch, participation status, weight, miss rate

Epoch: PoC weight, timeslot allocation, inference count, upcoming epoch, reward claim status

MLNode: model loaded, GPU utilization, PoC status, intended vs current mismatch, status freshness

Node Config: public URL, PoC callback URL, seed API, API version, height lag, upgrade plan

Security: cold key, warm key, ML permissions

Containers: running/stopped/unhealthy state inferred from setup/report checks

PUBLIC_URL reachability check — HTTP GET <public_url>/health, report PASS/FAIL with error details. Critical: port mismatch or wrong registered URL = validators can't verify PoC proofs = node never gains weight. Learned from testnet debugging (2026-02-11).

Network: peer count (blocked — Tendermint RPC 26657 not exposed to host on standard deployments)

Miss rate timeline (green/red dot visualization)

6.2 gonka-nop update (safe rollout) ✅ COMPLETE (implemented as M9.2)¶

Check timeslot_allocation via Admin API to find safe window

Disable ML node via POST /admin/v1/nodes/:id/disable

Pull new container image with progress

Update image tag in docker-compose file

Recreate container (--no-deps --force-recreate)

Wait for model load completion (monitor logs)

Re-enable ML node via POST /admin/v1/nodes/:id/enable

Verify health after update

Distinguish auto-update (Cosmovisor: node, api) vs manual (mlnode, proxy)

6.3 gonka-nop reset (blockchain data cleanup)¶

Preserve keys (tmkms, account, ML) and config files

Run inferenced tendermint unsafe-reset-all --keep-addr-book inside container

Remove upgrade-info.json and cosmovisor/ directory

Restart node container

Monitor sync progress after reset

6.4 gonka-nop cleanup (disk space recovery)¶

Calculate current disk usage (.inference/data/, cosmovisor backups)

Remove old Cosmovisor backup directories

Report freed space

Recommend pruning settings if not configured

6.5 gonka-nop ml-node (Admin API wrapper) 🔄 PARTIAL¶

ml-node list - GET /admin/v1/nodes (status, allocation, model, hardware, PoC weight)

ml-node add - POST /admin/v1/nodes (interactive or from config)

ml-node update - PUT /admin/v1/nodes/:id (model, TP/PP, max_concurrent)

ml-node enable/disable - POST /admin/v1/nodes/:id/enable|disable

ml-node status - Detailed: host, ports, model+args, hardware, status/intended mismatch, epoch allocation, PoC weight, timeslots, status freshness

6.6 gonka-nop model (model management)¶

Switch model via Admin API (PUT /admin/v1/nodes/:id)

Show current model and next-epoch model

Warn that model changes apply next epoch only

Validate model compatibility with GPU config

6.7 Pre-upgrade Binary Download¶

Download inferenced and decentralized-api binaries from GitHub releases

SHA256 verification

Place in correct Cosmovisor upgrade directory

Verify permissions (chmod +x)

Milestone 7: Registration & On-chain¶

7.1 Registration Flow ✅ COMPLETE¶

submit-new-participant with correct flags (validator-key, chain-id, node URL)

Auto-fetch consensus_pubkey from setup report API

grant-ml-ops-permissions for ML key

Validate PUBLIC_URL format (no http:// prefix)

7.2 PoC Verification¶

Test PoC endpoint (/api/v1/pow/init/generate)

Verify model loaded and responding

Check epoch participation after registration

7.3 Reward Management¶

gonka-nop claim-rewards - Simplified claim flow

Fetch seed from Admin API config

Force-claim via /admin/v1/claim-reward/recover for missed epochs

Show unclaimed reward history

7.4 Governance¶

gonka-nop vote - Simplified governance voting

Show active proposals

Vote with account key

Milestone 8: Advanced & Polish 🔄 IN PROGRESS¶

Multi-node batch management (update/status across 10+ nodes)

Miss rate timeline visualization (dot-based OK/missed display)

Cloud provider compatibility (Vast.ai port remapping, GCore bare metal)

Russian language support for error messages and prompts

Monitoring integration (Prometheus exporter + centralized push architecture) — see M8.1

Performance benchmarking integration (compressa-perf)

Self-update mechanism for gonka-nop binary

Documentation & release automation

8.1 Centralized Monitoring (Ansible) ✅ COMPLETE¶

Opt-in push-based monitoring for Gonka validators. Ansible-based deployment in ansible/ subdirectory.

Architecture: Exporter + Prometheus on validator → remote_write push → Central Prometheus + Grafana (operated by inc4). No inbound ports opened on validator nodes.

Components built: - [x] gonka-exporter role — bundles votkon's exporter (28 metrics from 5 API endpoints), builds Docker image locally, deploys via compose - [x] prometheus role — Prometheus with conditional remote_write, alert rules (11 rules in 2 groups), --web.enable-remote-write-receiver for central server - [x] alertmanager role — conditional Telegram/Discord/Slack notifications, route-based severity - [x] grafana role — auto-provisioned datasources + 2 dashboards (Fleet Overview, Node Deep Dive) - [x] playbooks/deploy-all.yml — full stack for central server (exporter + prometheus + alertmanager + grafana) - [x] playbooks/add-node.yml — add internal validator with remote_write to central - [x] playbooks/client-deploy.yml — self-contained for external validators (hardcoded central URL, validates prerequisites, verifies remote_write works) - [x] playbooks/client-teardown.yml — clean removal of monitoring from validator - [x] inventory/client.yml.example — template for external operators - [x] README.md — operator-focused documentation with architecture diagram, metrics table, alert rules, security guarantees - [x] .gitignore — protects client inventory files from accidental commits

Dashboards: - Fleet Overview (gonka-fleet-overview) — multi-node overview with status, block lag, miss rate, PoC weight, GPU, earnings - Node Deep Dive (gonka-node-deep-dive) — per-node detail with $instance template variable

Alert rules (11): - Critical: GonkaMissRateHigh (>20%), GonkaNodeFailed, GonkaBlockLagCritical (>200 blocks), GonkaNodeStopped, GonkaExporterDown - Warning: GonkaBlockLag (>50), GonkaCatchingUp, GonkaGPUUtilizationHigh (>95%), GonkaZeroWeight, GonkaZeroInferences, GonkaStatusMismatch

Milestone 9: Version Management ✅ COMPLETE (3/4 tasks, pre-seeding deferred)¶

Dynamic image versions and safe upgrade handling. Addresses the "chicken and egg" problem where NOP hardcodes image versions that become stale after chain upgrades.

Source of truth: gonka-ai/gonka GitHub repository - Mainnet: main branch → deploy/join/docker-compose.yml + docker-compose.mlnode.yml - Testnet: testnet/main branch → same paths

9.1 Dynamic Image Version Fetching ✅ COMPLETE¶

ImageVersions struct with per-service tags (node, api, tmkms, proxy, bridge, mlnode, nginx)

FetchImageVersions() fetches from GitHub raw URLs at setup time

ParseComposeImageVersions() extracts tags from compose YAML (handles comments, bridge digest pins, proxy vs proxy-ssl)

Fallback to hardcoded versions when GitHub unreachable

Network select phase fetches and populates state

Config generation uses per-service versions

12 tests covering parsing, fallback, disambiguation

9.2 gonka-nop update (safe version update) ✅ COMPLETE¶

Fetch latest versions from GitHub and compare with current compose files

Show diff of version changes before applying

For manual-update services (proxy, bridge, mlnode): update image tags in compose files

Safe MLNode rollout: check timeslot_allocation → disable → pull → recreate → wait model load → enable

For Cosmovisor-managed services (node, api): inform user these auto-update at chain upgrade block

--check flag to only show available updates without applying

9.3 Cosmovisor Upgrade Pre-seeding (during deploy)¶

Query chain governance proposals via REST API (/cosmos/gov/v1/proposals)

Detect pending or applied upgrade plans

Pre-download inferenced and decentralized-api binaries to Cosmovisor upgrade dirs

SHA256 verification from proposal plan.info field

Prevents node stuck in restart loop if offline during upgrade block

9.4 gonka-nop repair (detect and fix stuck nodes) ✅ COMPLETE¶

Detect "upgrade handler is missing" restart loop in node container logs

Parse upgrade-info.json info field for on-chain binary URLs (source of truth for mainnet vs testnet repos)

Download and place binaries in correct Cosmovisor upgrade directory (with SHA256 verification)

Update current symlink to point to upgrade directory

Restart node container

Fix: prefer upgrade-info.json URLs over GitHub release search (prevents wrong binary on testnet)

Milestone 10: Multi-MLNode Support (Separate Servers) — PLANNED¶

10.1 ml-node add/update/delete commands¶

ml-node add — POST /admin/v1/nodes (interactive + flag modes)

ml-node update <id> — PUT /admin/v1/nodes/:id

ml-node delete <id> — DELETE /admin/v1/nodes/:id

10.2 Setup type flag (--type full|network|mlnode)¶

--type network — chain node + API only, skip GPU/mlnode phases

--type mlnode --network-node URL — GPU server only, new mlnode-specific phases

--type full (default) — current behavior unchanged

10.3 MLNode-only setup phases¶

08_mlnode_config.go — generate mlnode compose + nginx only

09_mlnode_deploy.go — docker compose up + register with network node via Admin API

PoC callback URL warning for multi-server setups

Gonka Node Manager — Automated Node Deployment, Updates, and Monitoring¶

📝 Описание¶

Executive Summary¶

Problem Statement¶

Proposed Solution¶

Architecture Overview¶

Components¶

Core User Flows¶

Security Model¶

MVP Scope¶

Included¶

Explicitly Out of Scope¶

Implementation Plan¶

Phase 1 — MVP Installation (45–50% of budget)¶

Phase 2 — MVP Auto-Updates (20–25% of budget)¶

Phase 3 — MVP Monitoring (15–20% of budget)¶

Phase 4 — Stabilization & Technical Debt (10–15% of budget)¶

Budget¶

Accountability¶

Expected Outcome¶

💬 Комментарии (4)¶

Комментарий 1 — @Aktum1¶

Комментарий 2 — @Aktum1¶

Комментарий 3 — @andrey055¶

Комментарий 4 — @SegovChik¶

Implementation Milestones¶

Milestone 1: Core CLI Framework ✅ COMPLETE¶

1.1 Project Scaffolding¶

1.2 UI Utilities (needed for phases)¶

1.3 State Management¶

1.4 Phase System¶

1.5 Integration & Demo¶

1.6 Status Command¶

Milestone 2: Test Coverage (Tests First) 🔄 IN PROGRESS¶

2.1 Config Package Tests (target: 90%+)¶

2.2 Phases Package Tests (target: 80%+)¶

2.3 Status Package Tests (target: 70%+)¶

2.4 UI Package Tests (target: 50%+)¶

2.5 Integration Tests¶

Milestone 3: GPU & Prerequisites (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

3.1 GPU Detection (real nvidia-smi) ✅ COMPLETE¶

3.2 Docker & Runtime Checks ✅ COMPLETE¶

3.3 NVIDIA Driver Installation & Validation ✅ MOSTLY COMPLETE¶

3.4 System Prerequisites ✅ MOSTLY COMPLETE¶

Milestone 4: Configuration (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

4.1 TP/PP Recommendation Algorithm ✅ COMPLETE¶

4.2 node-config.json Generation ✅ COMPLETE¶

4.3 config.env Generation ✅ COMPLETE¶

4.4 docker-compose.yml Generation ✅ COMPLETE¶

4.5 DDoS Protection Defaults ✅ COMPLETE¶

4.6 Sync & Pruning Configuration ✅ COMPLETE¶

Milestone 5: Deployment (EXPANDED per chat analysis) ✅ MOSTLY COMPLETE¶

5.1 Docker Compose Orchestration ✅ COMPLETE¶

5.2 Firewall Configuration ⚠️ (from chat: Docker bypasses UFW)¶

5.3 Key Management (both workflows) ✅ COMPLETE¶

5.4 Model Weight Download ✅ COMPLETE¶

5.5 Health Checks & Sync Monitoring ✅ COMPLETE¶

Milestone 6: Operations ⚠️ NEW (from chat: day-2 is 90% of operator time)¶

6.1 gonka-nop status (real implementation) ✅ COMPLETE¶

6.2 gonka-nop update (safe rollout) ✅ COMPLETE (implemented as M9.2)¶

6.3 gonka-nop reset (blockchain data cleanup)¶

6.4 gonka-nop cleanup (disk space recovery)¶

6.5 gonka-nop ml-node (Admin API wrapper) 🔄 PARTIAL¶

6.6 gonka-nop model (model management)¶

6.7 Pre-upgrade Binary Download¶

Milestone 7: Registration & On-chain¶

7.1 Registration Flow ✅ COMPLETE¶

7.2 PoC Verification¶

7.3 Reward Management¶

7.4 Governance¶

Milestone 8: Advanced & Polish 🔄 IN PROGRESS¶

8.1 Centralized Monitoring (Ansible) ✅ COMPLETE¶

Milestone 9: Version Management ✅ COMPLETE (3/4 tasks, pre-seeding deferred)¶

9.1 Dynamic Image Version Fetching ✅ COMPLETE¶

9.2 gonka-nop update (safe version update) ✅ COMPLETE¶

9.3 Cosmovisor Upgrade Pre-seeding (during deploy)¶

9.4 gonka-nop repair (detect and fix stuck nodes) ✅ COMPLETE¶

Milestone 10: Multi-MLNode Support (Separate Servers) — PLANNED¶

10.1 ml-node add/update/delete commands¶

10.2 Setup type flag (--type full|network|mlnode)¶

Комментарий 1 — @Aktum1 ¶

Комментарий 2 — @Aktum1 ¶

Комментарий 3 — @andrey055 ¶

Комментарий 4 — @SegovChik ¶

6.1 `gonka-nop status` (real implementation) ✅ COMPLETE¶

6.2 `gonka-nop update` (safe rollout) ✅ COMPLETE (implemented as M9.2)¶

6.3 `gonka-nop reset` (blockchain data cleanup)¶

6.4 `gonka-nop cleanup` (disk space recovery)¶

6.5 `gonka-nop ml-node` (Admin API wrapper) 🔄 PARTIAL¶

6.6 `gonka-nop model` (model management)¶

9.2 `gonka-nop update` (safe version update) ✅ COMPLETE¶

9.4 `gonka-nop repair` (detect and fix stuck nodes) ✅ COMPLETE¶