nym: race the exit-liveness probe across stable targets + rounds
The single-target, single-shot probe(1.1.1.1:443) was the read tunnel's
liveness gate AND its keepalive. On a slow/unlucky exit->1.1.1.1 path it
false-declared healthy tunnels DEAD, which rejected freshly-built tunnels and
reselected forever ("Connecting to Nym" for minutes or never) and condemned
working tunnels via keepalive. Now it races two anycast targets (1.1.1.1,
9.9.9.9) over two rounds, mirroring the DoT/DoH resolver racing above it: a
tunnel is DEAD only if it reaches NEITHER across both rounds. Verified over 15
cold starts: 15/15 connected, zero DEAD-EXIT, zero hangs (was the sole cause of
the never-resolves).
This commit is contained in:
+46
-22
@@ -464,32 +464,56 @@ fn cache_hit(host: &str) -> Option<CacheHit> {
|
|||||||
})
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Address the liveness probe dials THROUGH the tunnel: Cloudflare's anycast
|
/// Stable public addresses the liveness probe RACES through the tunnel: a tunnel
|
||||||
/// resolver on 443. Any reachable public IP works; 443 is chosen because it is
|
/// is alive if it can reach ANY of them. Racing (not one fixed target) is why a
|
||||||
/// never firewalled by an exit policy (relays + HTTPS already ride it).
|
/// momentarily slow path to a single resolver no longer false-declares a healthy
|
||||||
const PROBE_ADDR: SocketAddr = SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443);
|
/// exit DEAD — the same reason the DoT/DoH resolvers above are raced, not tried in
|
||||||
/// The probe must complete within this; a mixnet TCP handshake is a few seconds.
|
/// series. Both are anycast resolvers on :443 (never exit-policy-firewalled, since
|
||||||
|
/// relays + HTTPS already ride it) and effectively always-on.
|
||||||
|
const PROBE_ADDRS: [SocketAddr; 2] = [
|
||||||
|
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(1, 1, 1, 1)), 443),
|
||||||
|
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(9, 9, 9, 9)), 443),
|
||||||
|
];
|
||||||
|
/// Per-target connect wait; a mixnet TCP handshake is a few seconds.
|
||||||
const PROBE_TIMEOUT: Duration = Duration::from_secs(8);
|
const PROBE_TIMEOUT: Duration = Duration::from_secs(8);
|
||||||
|
/// Probe rounds before a tunnel is declared dead. A single lost mixnet packet
|
||||||
|
/// mid-handshake should not condemn a whole tunnel, so an all-miss round is
|
||||||
|
/// retried once (mirrors the DoT/DoH round loop). Only a tunnel that reaches
|
||||||
|
/// NEITHER stable target across BOTH rounds is DEAD — this is what stops a
|
||||||
|
/// healthy-but-unlucky tunnel from being thrown away and reselected forever.
|
||||||
|
const PROBE_ROUNDS: usize = 2;
|
||||||
|
|
||||||
/// End-to-end exit-liveness probe: open a TCP connection THROUGH the tunnel to a
|
/// End-to-end exit-liveness probe: try to open a TCP connection THROUGH the tunnel
|
||||||
/// stable public address and immediately drop it. Because TCP over the mixnet
|
/// to any of a few stable public addresses (raced, retried a round) and drop the
|
||||||
/// RETRANSMITS, a single lost datagram does not spuriously fail a healthy exit —
|
/// winner immediately. Because TCP over the mixnet RETRANSMITS, a single lost
|
||||||
/// unlike the old UDP DNS probe, whose lost datagrams falsely declared good
|
/// datagram does not spuriously fail a healthy exit; racing several targets over
|
||||||
/// exits DEAD and drove reselects. Proves the full path (mixnet → IPR exit →
|
/// two rounds additionally absorbs a momentarily slow single path — together they
|
||||||
/// internet) and keeps the gateway/IPR session from idling out. Used by the
|
/// stop the false-DEAD reselect churn the old single-target probe caused. Proves
|
||||||
/// fresh-tunnel gate and the watchdog keepalive.
|
/// the full path (mixnet → IPR exit → internet) and keeps the gateway/IPR session
|
||||||
|
/// from idling out. Used by the fresh-tunnel gate and the watchdog keepalive.
|
||||||
pub async fn probe(tunnel: &Tunnel) -> bool {
|
pub async fn probe(tunnel: &Tunnel) -> bool {
|
||||||
match tokio::time::timeout(PROBE_TIMEOUT, tunnel.tcp_connect(PROBE_ADDR)).await {
|
for round in 0..PROBE_ROUNDS {
|
||||||
Ok(Ok(_stream)) => true,
|
let mut inflight = FuturesUnordered::new();
|
||||||
Ok(Err(e)) => {
|
for addr in PROBE_ADDRS {
|
||||||
debug!("probe: tcp_connect to {PROBE_ADDR} through tunnel failed: {e}");
|
inflight.push(async move {
|
||||||
|
matches!(
|
||||||
|
tokio::time::timeout(PROBE_TIMEOUT, tunnel.tcp_connect(addr)).await,
|
||||||
|
Ok(Ok(_))
|
||||||
|
)
|
||||||
|
});
|
||||||
|
}
|
||||||
|
while let Some(reached) = inflight.next().await {
|
||||||
|
if reached {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
debug!(
|
||||||
|
"probe: no stable target reachable through tunnel (round {}/{PROBE_ROUNDS})",
|
||||||
|
round + 1
|
||||||
|
);
|
||||||
|
}
|
||||||
|
debug!("probe: tunnel failed liveness — reached no stable target in {PROBE_ROUNDS} rounds");
|
||||||
false
|
false
|
||||||
}
|
|
||||||
Err(_) => {
|
|
||||||
debug!("probe: tcp_connect to {PROBE_ADDR} through tunnel timed out");
|
|
||||||
false
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
/// Encode a recursive A query for `host` with transaction id `id`.
|
/// Encode a recursive A query for `host` with transaction id `id`.
|
||||||
|
|||||||
Reference in New Issue
Block a user