nym: fast fresh-gate probe budget — dead exits condemned in 10s, not ~32s
Measured (30 cold trials at the gateway-race commit): the worst cold-start outliers (51s, 72s) were dominated by DEAD-EXIT condemnation cost, not the gateway. A fresh tunnel on a blackholing exit burned probe_fresh's doubled patient budget (2 calls x 2 rounds x 8s ~= 32s) before reselecting. Split the liveness probe into two budgets over a shared probe_with_budget: - probe() (established tunnels: watchdog keepalive + condemnation checks): UNCHANGED - 8s x 2 rounds, worst 16s, same patience as before. - probe_fresh() (gating a just-built tunnel before publish): 5s x 2 rounds x 1 call = 10s worst. 5s is 4.2x the measured worst healthy probe (successful probes: min 465ms / median 774ms / max 1197ms across 15 trials), so the build130 false-condemn regression stays far away. DEAD-EXIT declaration drops from ~36s measured to ~13-17s total; the warn now logs the isolated probe_ms. Also adds the gateway-race observability the verification flagged (race START / WON-by-draw + ms / loser reaped-connected vs dropped-pending / survivor-after-error), so the race is visible in [timing] logs instead of inferred statistically. Verify on the next emulator pass: DEAD EXIT probe_ms <= ~10s, zero false DEAD EXIT across ~15 cold trials, established keepalive/condemn cadence unchanged.
This commit is contained in:
+64
-17
@@ -474,30 +474,77 @@ const PROBE_ADDRS: [SocketAddr; 2] = [
|
|||||||
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443),
|
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443),
|
||||||
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(9, 9, 9, 9)), 443),
|
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(9, 9, 9, 9)), 443),
|
||||||
];
|
];
|
||||||
/// Per-target connect wait; a mixnet TCP handshake is a few seconds.
|
/// Per-target connect wait for the PATIENT probe of an ESTABLISHED tunnel
|
||||||
|
/// (watchdog keepalive + condemnation). A mixnet TCP handshake is a few seconds,
|
||||||
|
/// and an exit already in service must NEVER be thrown away over a momentary load
|
||||||
|
/// spike, so this stays deliberately generous at 8s — the pre-existing budget.
|
||||||
|
/// (The just-built-tunnel GATE uses the tighter [`FRESH_PROBE_TIMEOUT`]; the two
|
||||||
|
/// budgets are asymmetric on purpose — see [`probe_fresh`].)
|
||||||
const PROBE_TIMEOUT: Duration = Duration::from_secs(8);
|
const PROBE_TIMEOUT: Duration = Duration::from_secs(8);
|
||||||
/// Probe rounds before a tunnel is declared dead. A single lost mixnet packet
|
/// Probe rounds before an ESTABLISHED tunnel is declared dead. A single lost
|
||||||
/// mid-handshake should not condemn a whole tunnel, so an all-miss round is
|
/// mixnet packet mid-handshake should not condemn a whole tunnel, so an all-miss
|
||||||
/// retried once (mirrors the DoT/DoH round loop). Only a tunnel that reaches
|
/// round is retried once (mirrors the DoT/DoH round loop). Only a tunnel that
|
||||||
/// NEITHER stable target across BOTH rounds is DEAD — this is what stops a
|
/// reaches NEITHER stable target across BOTH rounds is DEAD — this is what stops a
|
||||||
/// healthy-but-unlucky tunnel from being thrown away and reselected forever.
|
/// healthy-but-unlucky tunnel from being thrown away and reselected forever.
|
||||||
const PROBE_ROUNDS: usize = 2;
|
const PROBE_ROUNDS: usize = 2;
|
||||||
|
|
||||||
/// End-to-end exit-liveness probe: try to open a TCP connection THROUGH the tunnel
|
/// Per-target connect wait for the FAST GATE of a FRESH, just-built tunnel (before
|
||||||
/// to any of a few stable public addresses (raced, retried a round) and drop the
|
/// it is published). Tighter than the established [`PROBE_TIMEOUT`] because a
|
||||||
/// winner immediately. Because TCP over the mixnet RETRANSMITS, a single lost
|
/// healthy fresh probe connects FAST: across 15 cold-start trials the SUCCESSFUL
|
||||||
/// datagram does not spuriously fail a healthy exit; racing several targets over
|
/// exit probe completed in 465–1197ms (median 774ms), so 5s is >4x the measured
|
||||||
/// two rounds additionally absorbs a momentarily slow single path — together they
|
/// worst case — ample headroom to never false-condemn a slow-but-healthy fresh
|
||||||
/// stop the false-DEAD reselect churn the old single-target probe caused. Proves
|
/// exit (the build130 single-shot regression we must not reintroduce). The point
|
||||||
/// the full path (mixnet → IPR exit → internet) and keeps the gateway/IPR session
|
/// of the asymmetry: a genuinely DEAD fresh exit (accepts the IPR handshake but
|
||||||
/// from idling out. Used by the fresh-tunnel gate and the watchdog keepalive.
|
/// delivers nothing) is now condemned in ~10s instead of the ~32s the doubled
|
||||||
|
/// patient probe cost on this path, which dominated the cold-start latency tail.
|
||||||
|
const FRESH_PROBE_TIMEOUT: Duration = Duration::from_secs(5);
|
||||||
|
/// Probe rounds for the fresh-tunnel gate. SAME 2-round retry as the established
|
||||||
|
/// path: a single lost mixnet datagram mid-handshake still gets a second chance
|
||||||
|
/// before the tunnel is condemned — the transient-loss protection the original
|
||||||
|
/// trigger-happy single-shot probe lacked. Worst-case fresh-gate budget is
|
||||||
|
/// therefore FRESH_PROBE_ROUNDS × FRESH_PROBE_TIMEOUT = 10s (vs the old ~32s).
|
||||||
|
const FRESH_PROBE_ROUNDS: usize = 2;
|
||||||
|
|
||||||
|
/// PATIENT end-to-end liveness probe of an ESTABLISHED tunnel, on the generous
|
||||||
|
/// [`PROBE_TIMEOUT`]/[`PROBE_ROUNDS`] budget (worst case ~16s). Used by the
|
||||||
|
/// watchdog keepalive and the condemnation exit-DNS check — an exit already in
|
||||||
|
/// service must never be false-condemned over a momentary hiccup. The FRESH,
|
||||||
|
/// just-built-tunnel gate uses [`probe_fresh`] instead (a tighter budget). See
|
||||||
|
/// [`probe_with_budget`] for the shared mechanics.
|
||||||
pub async fn probe(tunnel: &Tunnel) -> bool {
|
pub async fn probe(tunnel: &Tunnel) -> bool {
|
||||||
for round in 0..PROBE_ROUNDS {
|
probe_with_budget(tunnel, PROBE_TIMEOUT, PROBE_ROUNDS).await
|
||||||
|
}
|
||||||
|
|
||||||
|
/// FAST end-to-end liveness GATE for a FRESH, just-built tunnel, run BEFORE it is
|
||||||
|
/// published, on the tighter [`FRESH_PROBE_TIMEOUT`]/[`FRESH_PROBE_ROUNDS`] budget
|
||||||
|
/// (worst case ~10s vs the ~32s the doubled patient probe cost on this path). A
|
||||||
|
/// fresh exit that accepts the IPR handshake yet delivers nothing (a DEAD EXIT) is
|
||||||
|
/// condemned quickly instead of dominating the cold-start tail — WITHOUT
|
||||||
|
/// reintroducing the false-condemn of a healthy exit (build130): the 5s per-target
|
||||||
|
/// timeout is >4x the measured worst-case healthy fresh probe (1197ms) and the
|
||||||
|
/// 2-round retry still absorbs a single lost datagram. See [`probe_with_budget`].
|
||||||
|
pub async fn probe_fresh(tunnel: &Tunnel) -> bool {
|
||||||
|
probe_with_budget(tunnel, FRESH_PROBE_TIMEOUT, FRESH_PROBE_ROUNDS).await
|
||||||
|
}
|
||||||
|
|
||||||
|
/// Shared raced-targets liveness probe on an explicit per-target `timeout` /
|
||||||
|
/// `rounds` budget: try to open a TCP connection THROUGH the tunnel to any of a few
|
||||||
|
/// stable public addresses (raced, retried a round) and drop the winner
|
||||||
|
/// immediately. Because TCP over the mixnet RETRANSMITS, a single lost datagram
|
||||||
|
/// does not spuriously fail a healthy exit; racing several targets over multiple
|
||||||
|
/// rounds additionally absorbs a momentarily slow single path — together they stop
|
||||||
|
/// the false-DEAD reselect churn the old single-target probe caused. Proves the
|
||||||
|
/// full path (mixnet → IPR exit → internet) and keeps the gateway/IPR session from
|
||||||
|
/// idling out. Callers pick the budget: [`probe`] (patient, established tunnels)
|
||||||
|
/// vs [`probe_fresh`] (fast, fresh-tunnel gate) — the racing + multi-round
|
||||||
|
/// structure is identical, only the timeout/rounds differ.
|
||||||
|
async fn probe_with_budget(tunnel: &Tunnel, timeout: Duration, rounds: usize) -> bool {
|
||||||
|
for round in 0..rounds {
|
||||||
let mut inflight = FuturesUnordered::new();
|
let mut inflight = FuturesUnordered::new();
|
||||||
for addr in PROBE_ADDRS {
|
for addr in PROBE_ADDRS {
|
||||||
inflight.push(async move {
|
inflight.push(async move {
|
||||||
matches!(
|
matches!(
|
||||||
tokio::time::timeout(PROBE_TIMEOUT, tunnel.tcp_connect(addr)).await,
|
tokio::time::timeout(timeout, tunnel.tcp_connect(addr)).await,
|
||||||
Ok(Ok(_))
|
Ok(Ok(_))
|
||||||
)
|
)
|
||||||
});
|
});
|
||||||
@@ -508,11 +555,11 @@ pub async fn probe(tunnel: &Tunnel) -> bool {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
debug!(
|
debug!(
|
||||||
"probe: no stable target reachable through tunnel (round {}/{PROBE_ROUNDS})",
|
"probe: no stable target reachable through tunnel (round {}/{rounds})",
|
||||||
round + 1
|
round + 1
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
debug!("probe: tunnel failed liveness — reached no stable target in {PROBE_ROUNDS} rounds");
|
debug!("probe: tunnel failed liveness — reached no stable target in {rounds} rounds");
|
||||||
false
|
false
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+41
-24
@@ -45,7 +45,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
|
|||||||
use std::thread;
|
use std::thread;
|
||||||
use std::time::{Duration, Instant};
|
use std::time::{Duration, Instant};
|
||||||
|
|
||||||
use log::{error, info, warn};
|
use log::{debug, error, info, warn};
|
||||||
use parking_lot::RwLock;
|
use parking_lot::RwLock;
|
||||||
use smolmix::{Recipient, Tunnel};
|
use smolmix::{Recipient, Tunnel};
|
||||||
|
|
||||||
@@ -332,11 +332,18 @@ fn run_tunnel() {
|
|||||||
// publishing such a tunnel would blackhole every consumer
|
// publishing such a tunnel would blackhole every consumer
|
||||||
// until the watchdog caught it minutes later. Re-select
|
// until the watchdog caught it minutes later. Re-select
|
||||||
// immediately instead. (This is a CHEAP early signal; relay
|
// immediately instead. (This is a CHEAP early signal; relay
|
||||||
// reachability below is the authoritative one.)
|
// reachability below is the authoritative one.) Uses the FAST
|
||||||
if !probe_fresh(&tunnel).await {
|
// fresh-gate budget (~10s worst case) — NOT the patient
|
||||||
|
// established-tunnel probe (~32s doubled here before) — so a
|
||||||
|
// dead fresh exit no longer dominates the cold-start tail; see
|
||||||
|
// `dns::probe_fresh`.
|
||||||
|
let probe_started = Instant::now();
|
||||||
|
let alive = super::dns::probe_fresh(&tunnel).await;
|
||||||
|
let probe_ms = probe_started.elapsed().as_millis();
|
||||||
|
if !alive {
|
||||||
warn!(
|
warn!(
|
||||||
"[timing] nym: DEAD EXIT — fresh {} tunnel failed liveness probe after {}ms \
|
"[timing] nym: DEAD EXIT — fresh {} tunnel failed liveness probe in {probe_ms}ms \
|
||||||
(attempt {attempt}); {}",
|
({}ms total incl. build; attempt {attempt}); {}",
|
||||||
choice.label(),
|
choice.label(),
|
||||||
started.elapsed().as_millis(),
|
started.elapsed().as_millis(),
|
||||||
match choice {
|
match choice {
|
||||||
@@ -503,18 +510,6 @@ fn run_tunnel() {
|
|||||||
});
|
});
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Two attempts of the (TCP, retransmitting) liveness probe before rejecting a
|
|
||||||
/// fresh tunnel — one transient hiccup while the exit settles must not condemn
|
|
||||||
/// an otherwise healthy exit.
|
|
||||||
async fn probe_fresh(tunnel: &smolmix::Tunnel) -> bool {
|
|
||||||
for _ in 0..2 {
|
|
||||||
if super::dns::probe(tunnel).await {
|
|
||||||
return true;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
false
|
|
||||||
}
|
|
||||||
|
|
||||||
/// Exit-liveness keepalive period and the consecutive probe failures that
|
/// Exit-liveness keepalive period and the consecutive probe failures that
|
||||||
/// declare death (the probe is now a TCP connect through the tunnel, not UDP DNS).
|
/// declare death (the probe is now a TCP connect through the tunnel, not UDP DNS).
|
||||||
const KEEPALIVE_PERIOD: Duration = Duration::from_secs(60);
|
const KEEPALIVE_PERIOD: Duration = Duration::from_secs(60);
|
||||||
@@ -912,14 +907,16 @@ async fn connect_gateway_racing(
|
|||||||
|
|
||||||
// Spawn both so the loser can be aborted cleanly. `cfg` is `Copy`, so each task
|
// Spawn both so the loser can be aborted cleanly. `cfg` is `Copy`, so each task
|
||||||
// gets the identical anonymity config.
|
// gets the identical anonymity config.
|
||||||
|
let race_started = Instant::now();
|
||||||
let mut a = tokio::spawn(connect_one(cfg));
|
let mut a = tokio::spawn(connect_one(cfg));
|
||||||
let mut b = tokio::spawn(connect_one(cfg));
|
let mut b = tokio::spawn(connect_one(cfg));
|
||||||
|
debug!("[timing] nym: gateway race START — 2 ephemeral draws, first up wins");
|
||||||
|
|
||||||
// Whichever finishes first; keep `other` to reap (on a win) or fall back to (if
|
// Whichever finishes first; keep `other` to reap (on a win) or fall back to (if
|
||||||
// the first draw errored).
|
// the first draw errored). `winner` tags WHICH draw finished first.
|
||||||
let (first, other) = tokio::select! {
|
let (first, other, winner) = tokio::select! {
|
||||||
r = &mut a => (r, b),
|
r = &mut a => (r, b, 'A'),
|
||||||
r = &mut b => (r, a),
|
r = &mut b => (r, a, 'B'),
|
||||||
};
|
};
|
||||||
// A JoinError (task panic) folds into an error so `other` still gets its turn.
|
// A JoinError (task panic) folds into an error so `other` still gets its turn.
|
||||||
let first = first.unwrap_or_else(|e| {
|
let first = first.unwrap_or_else(|e| {
|
||||||
@@ -932,12 +929,26 @@ async fn connect_gateway_racing(
|
|||||||
match first {
|
match first {
|
||||||
// First to finish connected — it WINS. Reap the loser off the hot path.
|
// First to finish connected — it WINS. Reap the loser off the hot path.
|
||||||
Ok(client) => {
|
Ok(client) => {
|
||||||
|
info!(
|
||||||
|
"[timing] nym: gateway race WON by draw {winner} in {}ms; reaping loser off the hot path",
|
||||||
|
race_started.elapsed().as_millis()
|
||||||
|
);
|
||||||
other.abort();
|
other.abort();
|
||||||
tokio::spawn(async move {
|
tokio::spawn(async move {
|
||||||
// If the loser connected before the abort landed, disconnect it so
|
// If the loser connected before the abort landed, disconnect it so
|
||||||
// no live gateway session leaks; a pending connect was just dropped.
|
// no live gateway session leaks; a pending connect was just dropped.
|
||||||
if let Ok(Ok(loser)) = other.await {
|
match other.await {
|
||||||
loser.disconnect().await;
|
Ok(Ok(loser)) => {
|
||||||
|
debug!(
|
||||||
|
"[timing] nym: gateway race loser had connected before abort — \
|
||||||
|
disconnecting so no gateway session leaks"
|
||||||
|
);
|
||||||
|
loser.disconnect().await;
|
||||||
|
}
|
||||||
|
_ => debug!(
|
||||||
|
"[timing] nym: gateway race loser still pending at reap — dropped \
|
||||||
|
(no session to close)"
|
||||||
|
),
|
||||||
}
|
}
|
||||||
});
|
});
|
||||||
Ok(client)
|
Ok(client)
|
||||||
@@ -945,7 +956,13 @@ async fn connect_gateway_racing(
|
|||||||
// First draw failed — a lone client has no dead-draw tail, so just await the
|
// First draw failed — a lone client has no dead-draw tail, so just await the
|
||||||
// survivor; if it fails too, surface an error and `run_tunnel` re-selects.
|
// survivor; if it fails too, surface an error and `run_tunnel` re-selects.
|
||||||
Err(first_err) => match other.await {
|
Err(first_err) => match other.await {
|
||||||
Ok(Ok(client)) => Ok(client),
|
Ok(Ok(client)) => {
|
||||||
|
info!(
|
||||||
|
"[timing] nym: gateway race — draw {winner} errored, survivor connected in {}ms",
|
||||||
|
race_started.elapsed().as_millis()
|
||||||
|
);
|
||||||
|
Ok(client)
|
||||||
|
}
|
||||||
Ok(Err(second_err)) => {
|
Ok(Err(second_err)) => {
|
||||||
warn!(
|
warn!(
|
||||||
"[timing] nym: both raced gateway connects failed \
|
"[timing] nym: both raced gateway connects failed \
|
||||||
|
|||||||
Reference in New Issue
Block a user