🔄 Авто-синхронизация: из Discussion #1334 каждые 6 часов.

Devshard E2E Test Automation Proposal¶

Автор: @aikuznetsov · Категория: Proposals · Создано: 2026-06-10 19:23 UTC · Обновлено: 2026-06-12 13:36 UTC

📝 Описание¶

Goal¶

Build a real integration test layer for devshard that runs from Go tests but validates the system across Docker containers, real HTTP networking, real process boundaries, and real storage.

This suite should complement the existing unit, package, and httptest tests. Those tests remain the fast correctness layer. The E2E suite verifies that the same protocol works when the pieces are started, wired, restarted, and failed like real services.

Scope¶

The test runner is Go. The runtime is Docker.

The suite should not depend on a live Cosmos chain, Testermint, or decentralized-api. Chain-facing metadata is served by a local mock service. Inference and validation use deterministic stub engines unless a scenario explicitly opts into a different backend.

Out of scope for the first version:

production observability stack validation
long-running performance or soak testing
live chain settlement submission
real ML model execution
full versiond governance flow

Those can be added later as separate profiles once the core protocol E2E layer is stable.

Test Tools And Frameworks¶

The first E2E implementation should keep the toolchain small and Go-native.

Recommended tools:

Area	Tool	Role
Test runner	Go `testing`	Owns scenario execution, assertions, setup, teardown, and CI integration.
Containers	`testcontainers-go`	Starts Docker networks, containers, exposed ports, volumes, and readiness checks from Go.
Assertions	`stretchr/testify`	Keeps E2E checks readable and consistent with existing devshard tests.
HTTP client	Go `net/http` plus devshard clients	Drives `devshardctl`, host transport routes, mock-chain controls, and diagnostic endpoints.
JSON handling	Standard `encoding/json` or existing devshard JSON helpers	Parses OpenAI-compatible responses, control responses, and settlement payloads.
Docker images	Explicit `make` targets	Builds images before tests run; individual tests select prebuilt images and fail fast if missing.
Database	Testcontainers Postgres module	Runs the smoke storage backend and deeper recovery scenarios.
Logs	Docker/testcontainers log capture	Collects container logs on test failure for diagnosis.

Tools to avoid in the first version:

Python scenario runners: Go keeps the scenarios close to devshard types, signing helpers, storage helpers, and existing assertions. Adding Python would create a second test runtime before the E2E contract is stable.
Docker Compose as the primary test orchestrator: testcontainers-go gives each Go test direct control over networks, containers, ports, logs, restarts, and cleanup. Compose can still be useful later for manual reproduction.
live Cosmos chain or Testermint: the first suite should isolate devshard protocol and transport behavior from chain startup, block production, governance, and unrelated node failures. mock-chain covers the bridge contract needed by devshard.
real ML model execution: deterministic stub inference keeps tests fast, reproducible, and focused on protocol behavior rather than GPU/model availability or generation quality.
browser/UI automation: devshard E2E validates HTTP APIs and protocol state. Browser automation would add slow UI concerns that are not part of this proposal.

Test Environment Structure¶

Each test starts an isolated Docker network. The Go test process stays outside the network and controls the environment through Docker APIs and mapped service ports.

The default smoke environment should spin up:

one mock-chain container
three devshard-host-N containers
one devshardctl container
one postgres container

Storage and fault scenarios add containers or volumes as needed:

persistent SQLite volumes for restart tests
optional per-service control endpoints for deterministic fault injection

flowchart LR
    TestRunner["Go E2E test runner"]
    Docker["Docker / testcontainers-go"]
    Client["HTTP assertions"]

    subgraph Net["isolated Docker network"]
        MockChain["mock-chain\nchain metadata + control API"]
        DevshardCtl["devshardctl\nOpenAI-compatible API"]

        Host0["devshard-host-0\nslot 0"]
        Host1["devshard-host-1\nslot 1"]
        Host2["devshard-host-2\nslot 2"]

        Postgres["postgres\nsmoke storage backend"]
        Vol0[("host-0 SQLite volume")]
        Vol1[("host-1 SQLite volume")]
        Vol2[("host-2 SQLite volume")]
    end

    TestRunner --> Docker
    TestRunner --> Client
    Client --> DevshardCtl
    Client -.direct protocol checks.-> Host0
    Client -.direct protocol checks.-> Host1
    Client -.direct protocol checks.-> Host2

    DevshardCtl --> Host0
    DevshardCtl --> Host1
    DevshardCtl --> Host2

    Host0 <-->|gossip| Host1
    Host1 <-->|gossip| Host2
    Host2 <-->|gossip| Host0

    Host0 --> MockChain
    Host1 --> MockChain
    Host2 --> MockChain
    DevshardCtl --> MockChain

    Host0 --> Postgres
    Host1 --> Postgres
    Host2 --> Postgres

    Host0 -.sqlite profile.-> Vol0
    Host1 -.sqlite profile.-> Vol1
    Host2 -.sqlite profile.-> Vol2

Container inventory:

Container	Required	Count	Purpose
`mock-chain`	yes	1	Serves escrow, participant, epoch, version, and warm-key metadata. Provides dev-only control APIs for metadata faults.
`devshard-host-N`	yes	3 by default	Runs one real devshard participant per slot with transport, gossip, storage, signing, and stub engines.
`devshardctl`	yes for smoke	1	Exposes the OpenAI-compatible user API and drives the normal user-facing path.
`postgres`	yes	1	Provides the default smoke storage backend and production-like recovery coverage.
SQLite volumes	no	1 per host	Preserve host-local state across container restarts in SQLite recovery scenarios.

The first implementation should standardize on a three-host group because many protocol behaviors need a majority-like shape: executor rotation, timeout votes, signature accumulation, and gossip convergence. The harness can expose Hosts: N later for stress or edge-case tests.

Runtime Services¶

Each E2E environment starts an isolated Docker network and a small set of services.

`mock-chain`¶

mock-chain is a local metadata service that implements the subset of mainnet bridge behavior needed by devshard.

The first implementation should match the current REST bridge shape exactly. That keeps E2E focused on validating the bridge contract devshard already uses instead of adding a second mock-only API. A cleaner internal control API can be added alongside the REST-compatible endpoints later, but protocol setup and recovery should continue to exercise the same paths as production code.

It serves deterministic local config for:

escrow ID
escrow creator address
escrow balance
epoch ID
app hash
host slot assignments
host inference URLs
token price
validation threshold
warm key grants
approved devshard versions, when a version scenario needs them

It should also expose a dev-only control API for test scenarios:

advance epoch
change approved versions
change host metadata
add or remove warm key grants
inject response delays
inject bridge errors

`devshard-host-N`¶

Each host container runs one participant. The process should use the real devshard host, transport, signing, storage, gossip, and state machine code.

Configurable inputs:

escrow ID
host signer key
user address
slot assignment
route prefix
peer host URLs
storage backend
mock-chain URL
stub inference behavior
stub validation behavior

The host should expose the standard devshard transport routes, mounted under either the legacy route prefix or a versioned prefix:

/v1/devshard/*
/devshard/<version>/*

`devshardctl`¶

The suite should include scenarios that drive requests through the OpenAI-compatible devshardctl surface. This validates the user-facing path:

client -> devshardctl -> devshard transport clients -> host containers

Some lower-level scenarios can talk directly to host transport endpoints when that makes the assertion clearer, but the smoke suite should use devshardctl.

`postgres`¶

Postgres is part of the smoke environment and should be the default storage backend for CI smoke tests. SQLite remains useful for local restart tests and single-host persistence edge cases.

Storage scenarios should cover:

SQLite host restart
Postgres host restart
all-host restart
session version conflict
session epoch conflict where applicable

Test Binaries¶

The E2E suite needs runnable commands that are small wrappers around existing devshard packages.

Recommended commands:

devshard/cmd/devshardd/
  main.go

devshard/cmd/mock-chain/
  main.go

`devshardd`¶

devshardd runs one host participant.

For the first E2E implementation, devshardd should be an E2E-only command. It should not be treated as a production binary yet. This keeps the first iteration focused on integration validation, while leaving room to harden and promote the command later if it becomes the right production shape.

It should wire:

bridge client
state machine
host
transport server
storage
gossip peers
inference engine
validation engine
readiness endpoint
dev-only control endpoint when explicitly enabled

For E2E, devshardd can start with stub inference and validation engines. The important point is that the protocol runtime itself is real.

`mock-chain`¶

mock-chain serves local metadata and deterministic control behavior. It should start as a simple HTTP server matching the current REST bridge shape. If devshard later moves to a different chain client protocol, the mock should follow that boundary.

Fault Injection¶

Deterministic fault injection should be part of the test design from the beginning. Without it, timeout and recovery tests become slow and flaky.

The first control surface should support:

fail next inference
delay next inference
hang next inference until cancelled
withhold executor receipt
return a corrupt response hash
return invalid validation result
pause gossip
resume gossip
reject bridge metadata requests
return stale bridge metadata
advance mock epoch
change approved versions

Fault controls must be disabled unless the process is started in explicit test mode.

Scenario Set¶

Smoke Scenarios¶

Smoke scenarios should be reliable and fast enough for every CI run.

Happy path

Start three hosts and devshardctl. Send several non-streaming chat completion requests. Finalize the session. Assert the settlement output is present and all hosts agree on the final state.

Streaming path

Send a streaming chat completion request through devshardctl. Assert the client receives content chunks and [DONE]. Assert devshard protocol receipt/meta events are handled internally and do not corrupt the OpenAI-compatible stream.

Auth rejection

Send a protected host request signed by an unauthorized key. Assert the request is rejected with an authorization error.

Protocol Scenarios¶

Gossip convergence

Submit work while all hosts are running. Assert nonce, mempool, and signature data propagate between participants and converge.

Host catch-up

Let one host miss earlier diffs, then send it a later request with catch-up diffs. Assert it reaches the same state root as the rest of the group.

Executor failure and timeout

Configure the selected executor to fail or hang. Assert timeout votes are collected, the timeout transaction is applied, and the session can continue or finalize according to protocol rules.

Receipt challenge

Withhold or lose the executor response path, then challenge the executor for a receipt. Assert the receipt is valid and the user session can process it.

Recovery Scenarios¶

SQLite host restart

Run several inferences, restart one host container with its SQLite volume preserved, continue the session, and finalize. Assert there is no nonce regression and the restarted host signs the final state.

Postgres recovery

Run the happy path with Postgres storage enabled. Restart all hosts and continue the session. Assert state recovery from Postgres works and finalization succeeds.

All-host restart before finalization

Run several inferences, stop every host, restart them, then finalize. Assert persisted diffs and signatures are sufficient to recover.

Version And Routing Scenarios¶

Legacy route prefix

Run a session through /v1/devshard/* and assert the stored session version is v1.
Versioned route prefix

Run a session through /devshard/<version>/* and assert the stored session version is the selected version.
Version conflict

Create or recover the same escrow under one version, then attempt to attach the same escrow under a different version. Assert storage rejects the conflict.

Chain Metadata Scenarios¶

Warm key authorization

Configure a warm key grant in mock-chain. Assert the warm key can authenticate where allowed and is rejected after the grant is removed or when used for the wrong participant.
Bridge metadata failure

Inject a bridge metadata error during session creation or recovery. Assert the host fails ready or returns the expected service-unavailable response.

Assertions¶

E2E tests should avoid asserting only HTTP status codes. Useful protocol-level assertions include:

expected OpenAI-compatible response shape
expected SSE stream shape
monotonic nonce progression
expected inference status transitions
matching final state root across hosts
expected signatures by slot
settlement payload includes final nonce, state, version, and signatures
storage metadata pins escrow to the expected epoch and version
restarted hosts recover latest known state
unauthorized signers are rejected
fault scenarios produce the expected protocol transaction

Settlement Contract¶

Until the E2E suite submits settlement to a live chain, the stable settlement contract should be the protocol commitment needed for chain-side verification.

Baseline settlement assertions should cover:

escrow ID
session version
final nonce
final state root or final state commitment
terminal session phase
terminal state for every included inference
threshold-sufficient signatures
each signature verifies over the final state commitment
each signature maps to a valid slot in the session group
duplicate slot signatures are not counted twice

Economic fields such as token accounting, fees, remaining balance, host costs, missed counts, and validation penalties should be asserted only in dedicated accounting scenarios. They should not be part of the baseline smoke settlement contract until the chain submission path is part of the E2E suite.

CI Tiers¶

Use focused go test runs rather than one large undifferentiated suite. CI should build the required Docker images through explicit make targets before running the E2E suite. The Go tests should select already-built images rather than building images per test run.

Example targets:

make devshard-e2e-images
go test ./devshard/e2e -run TestE2E_Smoke -count=1
go test ./devshard/e2e -run TestE2E_Protocol -count=1
go test ./devshard/e2e -run TestE2E_Storage -count=1

devshard-e2e-images should be an explicit build target that produces the images used by the tests, including mock-chain, devshard-host, and devshardctl. The E2E tests should fail fast if those images are missing instead of silently rebuilding them inside individual test cases.

Recommended tiers:

Tier	Purpose	Typical scenarios
Smoke	Fast CI confidence with Postgres enabled	happy path, streaming, auth rejection
Protocol	Main protocol coverage	gossip, catch-up, timeout, receipt challenge
Storage	Deeper persistence coverage	SQLite restart, Postgres restart, all-host restart
Versioning	Route/version safety	legacy route, versioned route, version conflict
Fault	Slower failure coverage	delayed hosts, bad hashes, bridge faults

💬 Комментарии (1)¶

Комментарий 1 — @a-kuprin ¶

2026-06-12 13:36 UTC

@aikuznetsov Please take a look on this: https://github.com/a-kuprin/gonka/blob/1f0933ad9136cfbcf7070f8210e2c6694731ebaf/devshard/docs/proposals/TESTENV_PROPOSAL.md

It is using multiple devshardd, 1 devshardctl, 1 dapi-mock and 1 mock-chain dockers and doesn't use chain.

It even already used for testing new height-sync protocol for devshard: https://github.com/gonka-ai/gonka/pull/1209

The difference is that actually decentralized-api is used (but mock for protocol). decentralized-api is the MLServer - serving nodes, and also oracle for parameters and height.

Also I had some thoughts on more high-level scripting over test-environment for creating test plans: https://github.com/a-kuprin/gonka/blob/devshard-testenv/devshard/docs/proposals/PROTOCOL_TESTING_PROPOSAL.md

Devshard E2E Test Automation Proposal¶

📝 Описание¶

Goal¶

Scope¶

Test Tools And Frameworks¶

Test Environment Structure¶

Runtime Services¶

mock-chain¶

devshard-host-N¶

devshardctl¶

postgres¶

Test Binaries¶

devshardd¶

mock-chain¶

Fault Injection¶

Scenario Set¶

Smoke Scenarios¶

Protocol Scenarios¶

Recovery Scenarios¶

Version And Routing Scenarios¶

Chain Metadata Scenarios¶

Assertions¶

Settlement Contract¶

CI Tiers¶

💬 Комментарии (1)¶

Комментарий 1 — @a-kuprin¶

💬 Комментарии

`mock-chain`¶

`devshard-host-N`¶

`devshardctl`¶

`postgres`¶

`devshardd`¶

`mock-chain`¶

Комментарий 1 — @a-kuprin ¶