1
0
forked from GRIN/grim

nym: fast fresh-gate probe budget — dead exits condemned in 10s, not ~32s

Measured (30 cold trials at the gateway-race commit): the worst cold-start
outliers (51s, 72s) were dominated by DEAD-EXIT condemnation cost, not the
gateway. A fresh tunnel on a blackholing exit burned probe_fresh's doubled
patient budget (2 calls x 2 rounds x 8s ~= 32s) before reselecting.

Split the liveness probe into two budgets over a shared probe_with_budget:
- probe() (established tunnels: watchdog keepalive + condemnation checks):
  UNCHANGED - 8s x 2 rounds, worst 16s, same patience as before.
- probe_fresh() (gating a just-built tunnel before publish): 5s x 2 rounds
  x 1 call = 10s worst. 5s is 4.2x the measured worst healthy probe
  (successful probes: min 465ms / median 774ms / max 1197ms across 15
  trials), so the build130 false-condemn regression stays far away.

DEAD-EXIT declaration drops from ~36s measured to ~13-17s total; the warn
now logs the isolated probe_ms. Also adds the gateway-race observability the
verification flagged (race START / WON-by-draw + ms / loser reaped-connected
vs dropped-pending / survivor-after-error), so the race is visible in
[timing] logs instead of inferred statistically.

Verify on the next emulator pass: DEAD EXIT probe_ms <= ~10s, zero false
DEAD EXIT across ~15 cold trials, established keepalive/condemn cadence
unchanged.
This commit is contained in:
2ro
2026-07-03 15:55:37 -04:00
parent ce23214d98
commit c78d7b0e60
2 changed files with 105 additions and 41 deletions
+64 -17
View File
@@ -474,30 +474,77 @@ const PROBE_ADDRS: [SocketAddr; 2] = [
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443), SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443),
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(9, 9, 9, 9)), 443), SocketAddr::new(IpAddr::V4(Ipv4Addr::new(9, 9, 9, 9)), 443),
]; ];
/// Per-target connect wait; a mixnet TCP handshake is a few seconds. /// Per-target connect wait for the PATIENT probe of an ESTABLISHED tunnel
/// (watchdog keepalive + condemnation). A mixnet TCP handshake is a few seconds,
/// and an exit already in service must NEVER be thrown away over a momentary load
/// spike, so this stays deliberately generous at 8s — the pre-existing budget.
/// (The just-built-tunnel GATE uses the tighter [`FRESH_PROBE_TIMEOUT`]; the two
/// budgets are asymmetric on purpose — see [`probe_fresh`].)
const PROBE_TIMEOUT: Duration = Duration::from_secs(8); const PROBE_TIMEOUT: Duration = Duration::from_secs(8);
/// Probe rounds before a tunnel is declared dead. A single lost mixnet packet /// Probe rounds before an ESTABLISHED tunnel is declared dead. A single lost
/// mid-handshake should not condemn a whole tunnel, so an all-miss round is /// mixnet packet mid-handshake should not condemn a whole tunnel, so an all-miss
/// retried once (mirrors the DoT/DoH round loop). Only a tunnel that reaches /// round is retried once (mirrors the DoT/DoH round loop). Only a tunnel that
/// NEITHER stable target across BOTH rounds is DEAD — this is what stops a /// reaches NEITHER stable target across BOTH rounds is DEAD — this is what stops a
/// healthy-but-unlucky tunnel from being thrown away and reselected forever. /// healthy-but-unlucky tunnel from being thrown away and reselected forever.
const PROBE_ROUNDS: usize = 2; const PROBE_ROUNDS: usize = 2;
/// End-to-end exit-liveness probe: try to open a TCP connection THROUGH the tunnel /// Per-target connect wait for the FAST GATE of a FRESH, just-built tunnel (before
/// to any of a few stable public addresses (raced, retried a round) and drop the /// it is published). Tighter than the established [`PROBE_TIMEOUT`] because a
/// winner immediately. Because TCP over the mixnet RETRANSMITS, a single lost /// healthy fresh probe connects FAST: across 15 cold-start trials the SUCCESSFUL
/// datagram does not spuriously fail a healthy exit; racing several targets over /// exit probe completed in 4651197ms (median 774ms), so 5s is >4x the measured
/// two rounds additionally absorbs a momentarily slow single path — together they /// worst case — ample headroom to never false-condemn a slow-but-healthy fresh
/// stop the false-DEAD reselect churn the old single-target probe caused. Proves /// exit (the build130 single-shot regression we must not reintroduce). The point
/// the full path (mixnet → IPR exit → internet) and keeps the gateway/IPR session /// of the asymmetry: a genuinely DEAD fresh exit (accepts the IPR handshake but
/// from idling out. Used by the fresh-tunnel gate and the watchdog keepalive. /// delivers nothing) is now condemned in ~10s instead of the ~32s the doubled
/// patient probe cost on this path, which dominated the cold-start latency tail.
const FRESH_PROBE_TIMEOUT: Duration = Duration::from_secs(5);
/// Probe rounds for the fresh-tunnel gate. SAME 2-round retry as the established
/// path: a single lost mixnet datagram mid-handshake still gets a second chance
/// before the tunnel is condemned — the transient-loss protection the original
/// trigger-happy single-shot probe lacked. Worst-case fresh-gate budget is
/// therefore FRESH_PROBE_ROUNDS × FRESH_PROBE_TIMEOUT = 10s (vs the old ~32s).
const FRESH_PROBE_ROUNDS: usize = 2;
/// PATIENT end-to-end liveness probe of an ESTABLISHED tunnel, on the generous
/// [`PROBE_TIMEOUT`]/[`PROBE_ROUNDS`] budget (worst case ~16s). Used by the
/// watchdog keepalive and the condemnation exit-DNS check — an exit already in
/// service must never be false-condemned over a momentary hiccup. The FRESH,
/// just-built-tunnel gate uses [`probe_fresh`] instead (a tighter budget). See
/// [`probe_with_budget`] for the shared mechanics.
pub async fn probe(tunnel: &Tunnel) -> bool { pub async fn probe(tunnel: &Tunnel) -> bool {
for round in 0..PROBE_ROUNDS { probe_with_budget(tunnel, PROBE_TIMEOUT, PROBE_ROUNDS).await
}
/// FAST end-to-end liveness GATE for a FRESH, just-built tunnel, run BEFORE it is
/// published, on the tighter [`FRESH_PROBE_TIMEOUT`]/[`FRESH_PROBE_ROUNDS`] budget
/// (worst case ~10s vs the ~32s the doubled patient probe cost on this path). A
/// fresh exit that accepts the IPR handshake yet delivers nothing (a DEAD EXIT) is
/// condemned quickly instead of dominating the cold-start tail — WITHOUT
/// reintroducing the false-condemn of a healthy exit (build130): the 5s per-target
/// timeout is >4x the measured worst-case healthy fresh probe (1197ms) and the
/// 2-round retry still absorbs a single lost datagram. See [`probe_with_budget`].
pub async fn probe_fresh(tunnel: &Tunnel) -> bool {
probe_with_budget(tunnel, FRESH_PROBE_TIMEOUT, FRESH_PROBE_ROUNDS).await
}
/// Shared raced-targets liveness probe on an explicit per-target `timeout` /
/// `rounds` budget: try to open a TCP connection THROUGH the tunnel to any of a few
/// stable public addresses (raced, retried a round) and drop the winner
/// immediately. Because TCP over the mixnet RETRANSMITS, a single lost datagram
/// does not spuriously fail a healthy exit; racing several targets over multiple
/// rounds additionally absorbs a momentarily slow single path — together they stop
/// the false-DEAD reselect churn the old single-target probe caused. Proves the
/// full path (mixnet → IPR exit → internet) and keeps the gateway/IPR session from
/// idling out. Callers pick the budget: [`probe`] (patient, established tunnels)
/// vs [`probe_fresh`] (fast, fresh-tunnel gate) — the racing + multi-round
/// structure is identical, only the timeout/rounds differ.
async fn probe_with_budget(tunnel: &Tunnel, timeout: Duration, rounds: usize) -> bool {
for round in 0..rounds {
let mut inflight = FuturesUnordered::new(); let mut inflight = FuturesUnordered::new();
for addr in PROBE_ADDRS { for addr in PROBE_ADDRS {
inflight.push(async move { inflight.push(async move {
matches!( matches!(
tokio::time::timeout(PROBE_TIMEOUT, tunnel.tcp_connect(addr)).await, tokio::time::timeout(timeout, tunnel.tcp_connect(addr)).await,
Ok(Ok(_)) Ok(Ok(_))
) )
}); });
@@ -508,11 +555,11 @@ pub async fn probe(tunnel: &Tunnel) -> bool {
} }
} }
debug!( debug!(
"probe: no stable target reachable through tunnel (round {}/{PROBE_ROUNDS})", "probe: no stable target reachable through tunnel (round {}/{rounds})",
round + 1 round + 1
); );
} }
debug!("probe: tunnel failed liveness — reached no stable target in {PROBE_ROUNDS} rounds"); debug!("probe: tunnel failed liveness — reached no stable target in {rounds} rounds");
false false
} }
+41 -24
View File
@@ -45,7 +45,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use std::thread; use std::thread;
use std::time::{Duration, Instant}; use std::time::{Duration, Instant};
use log::{error, info, warn}; use log::{debug, error, info, warn};
use parking_lot::RwLock; use parking_lot::RwLock;
use smolmix::{Recipient, Tunnel}; use smolmix::{Recipient, Tunnel};
@@ -332,11 +332,18 @@ fn run_tunnel() {
// publishing such a tunnel would blackhole every consumer // publishing such a tunnel would blackhole every consumer
// until the watchdog caught it minutes later. Re-select // until the watchdog caught it minutes later. Re-select
// immediately instead. (This is a CHEAP early signal; relay // immediately instead. (This is a CHEAP early signal; relay
// reachability below is the authoritative one.) // reachability below is the authoritative one.) Uses the FAST
if !probe_fresh(&tunnel).await { // fresh-gate budget (~10s worst case) — NOT the patient
// established-tunnel probe (~32s doubled here before) — so a
// dead fresh exit no longer dominates the cold-start tail; see
// `dns::probe_fresh`.
let probe_started = Instant::now();
let alive = super::dns::probe_fresh(&tunnel).await;
let probe_ms = probe_started.elapsed().as_millis();
if !alive {
warn!( warn!(
"[timing] nym: DEAD EXIT — fresh {} tunnel failed liveness probe after {}ms \ "[timing] nym: DEAD EXIT — fresh {} tunnel failed liveness probe in {probe_ms}ms \
(attempt {attempt}); {}", ({}ms total incl. build; attempt {attempt}); {}",
choice.label(), choice.label(),
started.elapsed().as_millis(), started.elapsed().as_millis(),
match choice { match choice {
@@ -503,18 +510,6 @@ fn run_tunnel() {
}); });
} }
/// Two attempts of the (TCP, retransmitting) liveness probe before rejecting a
/// fresh tunnel — one transient hiccup while the exit settles must not condemn
/// an otherwise healthy exit.
async fn probe_fresh(tunnel: &smolmix::Tunnel) -> bool {
for _ in 0..2 {
if super::dns::probe(tunnel).await {
return true;
}
}
false
}
/// Exit-liveness keepalive period and the consecutive probe failures that /// Exit-liveness keepalive period and the consecutive probe failures that
/// declare death (the probe is now a TCP connect through the tunnel, not UDP DNS). /// declare death (the probe is now a TCP connect through the tunnel, not UDP DNS).
const KEEPALIVE_PERIOD: Duration = Duration::from_secs(60); const KEEPALIVE_PERIOD: Duration = Duration::from_secs(60);
@@ -912,14 +907,16 @@ async fn connect_gateway_racing(
// Spawn both so the loser can be aborted cleanly. `cfg` is `Copy`, so each task // Spawn both so the loser can be aborted cleanly. `cfg` is `Copy`, so each task
// gets the identical anonymity config. // gets the identical anonymity config.
let race_started = Instant::now();
let mut a = tokio::spawn(connect_one(cfg)); let mut a = tokio::spawn(connect_one(cfg));
let mut b = tokio::spawn(connect_one(cfg)); let mut b = tokio::spawn(connect_one(cfg));
debug!("[timing] nym: gateway race START — 2 ephemeral draws, first up wins");
// Whichever finishes first; keep `other` to reap (on a win) or fall back to (if // Whichever finishes first; keep `other` to reap (on a win) or fall back to (if
// the first draw errored). // the first draw errored). `winner` tags WHICH draw finished first.
let (first, other) = tokio::select! { let (first, other, winner) = tokio::select! {
r = &mut a => (r, b), r = &mut a => (r, b, 'A'),
r = &mut b => (r, a), r = &mut b => (r, a, 'B'),
}; };
// A JoinError (task panic) folds into an error so `other` still gets its turn. // A JoinError (task panic) folds into an error so `other` still gets its turn.
let first = first.unwrap_or_else(|e| { let first = first.unwrap_or_else(|e| {
@@ -932,12 +929,26 @@ async fn connect_gateway_racing(
match first { match first {
// First to finish connected — it WINS. Reap the loser off the hot path. // First to finish connected — it WINS. Reap the loser off the hot path.
Ok(client) => { Ok(client) => {
info!(
"[timing] nym: gateway race WON by draw {winner} in {}ms; reaping loser off the hot path",
race_started.elapsed().as_millis()
);
other.abort(); other.abort();
tokio::spawn(async move { tokio::spawn(async move {
// If the loser connected before the abort landed, disconnect it so // If the loser connected before the abort landed, disconnect it so
// no live gateway session leaks; a pending connect was just dropped. // no live gateway session leaks; a pending connect was just dropped.
if let Ok(Ok(loser)) = other.await { match other.await {
loser.disconnect().await; Ok(Ok(loser)) => {
debug!(
"[timing] nym: gateway race loser had connected before abort — \
disconnecting so no gateway session leaks"
);
loser.disconnect().await;
}
_ => debug!(
"[timing] nym: gateway race loser still pending at reap — dropped \
(no session to close)"
),
} }
}); });
Ok(client) Ok(client)
@@ -945,7 +956,13 @@ async fn connect_gateway_racing(
// First draw failed — a lone client has no dead-draw tail, so just await the // First draw failed — a lone client has no dead-draw tail, so just await the
// survivor; if it fails too, surface an error and `run_tunnel` re-selects. // survivor; if it fails too, surface an error and `run_tunnel` re-selects.
Err(first_err) => match other.await { Err(first_err) => match other.await {
Ok(Ok(client)) => Ok(client), Ok(Ok(client)) => {
info!(
"[timing] nym: gateway race — draw {winner} errored, survivor connected in {}ms",
race_started.elapsed().as_millis()
);
Ok(client)
}
Ok(Err(second_err)) => { Ok(Err(second_err)) => {
warn!( warn!(
"[timing] nym: both raced gateway connects failed \ "[timing] nym: both raced gateway connects failed \