Drew: first long form summary, claude produced from notes

2026-06-29T14:58:22Z

first long form summary, claude produced from notes

New page

= Silent network hang on Raspberry Pi CM5 / Pi 5 (macb / RP1 ethernet) =

''A technical writeup of a silent ethernet TX stall on the Raspberry Pi Compute
Module 5, what it looks like, how to tell it apart from superficially similar bugs,
and which mitigations did and did not work in our testing. Written for others hitting
the same failure. Claims here are limited to what we directly observed or can cite;
inference and open questions are marked as such.''

== Summary ==

On Raspberry Pi 5 / CM5 hardware (Cadence GEM MAC, <code>macb</code> driver, RP1
southbridge), the on-board gigabit ethernet can '''stop transmitting while the link
layer continues to report the link as up'''. The receive path keeps working, the
driver logs nothing, all <code>ethtool</code> error counters stay at zero, and the
kernel's own TX watchdog never fires. The host remains fully alive locally; it is
simply unreachable on the network until the NIC is reset (a link down/up sometimes
suffices; a reboot always does).

We observed this on '''three CM5 nodes''' running '''Ubuntu 26.04''' with the
'''<code>linux-raspi</code> 7.0.0-1011''' kernel. Based on a captured counter
time-series (below), we attribute it to the '''silent TX-ring stall''' described and
fixed upstream in the Raspberry Pi Foundation kernel
([https://github.com/raspberrypi/linux/issues/7339 raspberrypi/linux #7339] /
[https://github.com/raspberrypi/linux/pull/7340 PR #7340]). That fix is '''not yet in
Ubuntu's <code>linux-raspi</code>''' ([https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877
Launchpad #2133877], still "Confirmed" as of 2026-06-29).

== Environment where we observed it ==

{| class="wikitable"
! Item !! Value
|-
| Board || Raspberry Pi Compute Module 5 (CM5), arm64
|-
| OS / kernel || Ubuntu 26.04 LTS, <code>linux-raspi</code> <code>7.0.0-1011-raspi</code>
|-
| NIC (from <code>dmesg</code>) || <code>macb 1f00100000.ethernet eth0: Cadence GEM rev 0x00070109</code>, PHY <code>Broadcom BCM54213PE</code>
|-
| Topology || On Pi 5 / CM5 the ethernet MAC is reached over PCIe via the '''RP1''' southbridge (unlike Pi 4, where the MAC is on the SoC). The RP1↔SoC DMA path is relevant to the root cause.
|-
| Root filesystem || NVMe (not SD). Relevant only because it makes the driver swap below low-risk.
|}

We did '''not''' observe the failure on an x86-64 node with a different NIC in the
same cluster, consistent with this being specific to the <code>macb</code>/RP1 path
rather than anything in the surrounding software.

== Symptom and signature ==

The defining property is that '''carrier never drops'''. A cable, switch, or PHY
fault produces a <code>Link is Down</code> event; this failure does not.

Observed, every time:

* <code>ip link</code> shows the interface <code>LOWER_UP</code>; the last carrier event in the log is the <code>Link is Up - 1Gbps/Full</code> from boot. No carrier loss is ever logged.
* All off-host traffic stops '''simultaneously''' — gateway, peers, NFS, DNS. ARP for the gateway goes <code>INCOMPLETE</code>.
* The <code>macb</code> driver emits '''nothing''' — no TX-timeout, no DMA/IRQ error. <code>ethtool -S eth0</code> shows '''zero''' on every error and drop counter.
* The kernel netdev TX watchdog (<code>dev_watchdog</code>) '''does not fire''', because <code>trans_start</code> keeps being updated — this is why the stall is "silent."
* The host is otherwise healthy: the kernel is responsive on the local console, journald keeps writing, no panic, no OOM, no thermal event.
* No self-recovery. The interface stays in this state indefinitely until reset.

If you see '''link-up + all-egress-dead + no carrier event + driver silent + zero
error counters''', you are very likely looking at this bug rather than a cable, PHY,
congestion, or buffer-exhaustion problem.

=== Why it is easy to miss, and why we caught it quickly ===

This failure '''self-erases on reboot''': the only evidence is in the previous boot's
log, and that log shows nothing wrong with the link. On a desktop or a stateless
worker it tends to read as a one-off "the machine dropped off the network," and a
reboot makes it disappear. (Our first encounter, on an uninstrumented node, was
exactly this — unreachable, no captured logs, reboot fixed it, cause unknown at the
time.)

It became unmissable because the affected hosts were '''etcd voting members''' of a
k3s control plane. When the NIC stalls on such a node, the local etcd is partitioned,
the local API server can no longer serve quorum reads, the controller-manager loses
its lease, and the k3s process exits and is restarted — a loud, logged, repeating
failure rather than a silent drop. That is an artifact of our test setup, not of the
bug, but it is a useful detail: '''if you run any quorum service (etcd, Consul, etc.)
on CM5 hardware, this bug will surface as a control-plane meltdown, not as a quiet
link blip.''' The underlying event is still just one stalled TX ring.

== Diagnostic evidence we collected ==

Because none of the standard mechanisms detect this (carrier stays up, so
<code>systemd-networkd</code> sees nothing; <code>dev_watchdog</code> never fires; the
hardware watchdog keeps being petted because the host is alive), we ran a
'''reachability''' probe that, on N consecutive failures, captured a read-only
forensic snapshot ''before'' resetting the interface. The snapshots are what let us
classify the failure rather than guess at it.

What the snapshots showed, consistently:

* '''Link up, 1000/Full, <code>Link detected: yes</code>''' at the moment of the hang.
* '''Every error and drop counter at 0''' (<code>rx_resource_errors</code>, <code>rx_overruns</code>, FCS errors, <code>tx_carrier_sense_errors</code>, <code>q0_{tx,rx}_dropped</code>). Rules out cable, congestion, buffer exhaustion, and PHY faults.
* '''All four EEE/LPI counters (<code>{rx,tx}_lpi_{transitions,time}</code>) at 0''' — the PHY had not entered Low Power Idle.
* '''No RCU stalls''' in the kernel log preceding any hang.
* '''A single TX queue''' (only <code>q0_*</code> counters exist) with the eth0 IRQ servicing entirely on CPU0.

The decisive capture was a '''3-sample, ~1-second-apart time-series''' of the tx/rx
frame counters taken at the moment of one hang:

{| class="wikitable"
! Sample !! <code>tx_frames</code> !! <code>rx_frames</code>
|-
| t0 || 23553102 || 26633386
|-
| t1 || 23553102 || 26633400
|-
| t2 || 23553102 || 26633423
|}

'''<code>tx_frames</code> is frozen across all three samples while <code>rx_frames</code>
continues to climb''', and the eth0 IRQ count keeps advancing. This is a direct
observation of the TX path being stalled while RX and interrupts are still live — not
an inference from symptoms. It is this datapoint, more than the symptom description,
that drives the root-cause attribution below.

== Root-cause analysis: which bug is this? ==

There are several public reports that look alike at the symptom level. They are '''not
all the same bug''', and getting the mechanism right determines which mitigation is
worth applying. This section lays out the candidates and the evidence for our
attribution.

=== The candidate reports ===

{| class="wikitable"
! Report !! Proposed mechanism !! Same silicon?
|-
| '''Ubuntu Launchpad #2133877''' — "Complete network hang on Raspberry Pi 5 … possibly related to CPU frequency scaling" || CPU '''frequency-scaling transitions''' corrupting the RP1/macb DMA path; the report notes RCU stalls as a precursor and that the <code>performance</code> governor stopped those RCU stalls. || '''Yes''' (Pi 5, <code>linux-raspi</code>)
|-
| '''raspberrypi/linux #7339 / PR #7340''' — "candidate fixes for silent TX stall on BCM2712/RP1" || A '''silent TX stall in the <code>macb</code> driver''': tx stops advancing, RX keeps working, link stays up. Three driver patches. || '''Yes''' (BCM2712/RP1, <code>macb</code>)
|-
| '''siderolabs/sbc-raspberrypi #91''' — "silent network death on Talos" || Decomposes the failure into an '''EEE LPI-wake race''', a '''macb TSO/GSO TX-ring hang''' (single TX queue, softirqs on CPU0, small rings), and a Talos-specific socket issue. Prevention bundle: EEE off + TSO/GSO off + larger rings. || '''Yes''' (Pi 5, BCM2712 + Cadence GEM + BCM54213PE)
|-
| Various "<code>ethtool -K eth0 tso off gso off</code> fixes it" posts || Trace originally to Intel '''e1000e''' "Detected Hardware Unit Hang" — a '''different NIC and driver'''. || '''No'''
|}

=== What our evidence supports, and what it argues against ===

'''The symptom matches #2133877 essentially line-for-line''' — same NIC, complete
network hang, host alive locally, clean <code>ethtool</code> stats, no <code>macb</code>
log lines, recovers only on reset. So at the ''symptom'' level we are clearly in the
#2133877 family.

'''But #2133877's proposed mechanism (frequency scaling) does not fit our
observations,''' on three independent points:

# '''No RCU stalls.''' #2133877's headline precursor is RCU stalls (the reporter saw them at a steady rate, and pinning the governor stopped them). We logged '''zero''' RCU stalls before any hang, across every event. The precursor that report hangs its causal story on is absent on our hardware.
# '''The hang occurred at maximum, pinned frequency.''' We trialed the <code>performance</code> governor (which holds the cores at a fixed maximum frequency and therefore eliminates frequency ''transitions'') on the worst-affected node. It '''hung again ~30 minutes later''', and our counter capture from a separate hang shows the cores already at the maximum 2400 MHz at the time of the stall. If frequency transitions were necessary to trigger the hang, pinning the frequency should have prevented it. It did not. (This is a single negative trial, n=1, but it is a direct one.)
# '''The captured time-series matches #7340's description exactly.''' #7340 describes the failure as "tx_packets stops incrementing, RX still works, link stays up." Our capture shows precisely that — frozen <code>tx_frames</code>, climbing <code>rx_frames</code>, live IRQ. #2133877 does not characterize the failure at the ring-counter level; #7340 does, and our data fits it.

'''Our interpretation:''' #2133877 and #7339 are most likely the '''same underlying
defect''', and the "frequency scaling" wording in #2133877's title is a hypothesis
from before the TX-ring stall was isolated, not an established mechanism. The
root-cause work that produced an actual fix is #7340, whose described signature is the
one we captured. We treat the frequency-scaling angle as '''not supported on our
hardware''': we cannot rule out a frequency ''transition'' acting as one possible
trigger, but our evidence shows transitions are '''not necessary''' (the hang happens
at pinned max frequency) and that the RCU-stall precursor central to that report is
'''absent''' for us.

=== On the EEE and TSO/GSO theories (siderolabs #91) ===

siderolabs #91 is the closest same-silicon analysis and is the source of the popular
prevention bundle (EEE off, TSO/GSO off, larger rings). Two of its three triggers are
worth separating against our data:

* '''EEE / LPI-wake race''' — '''not our mechanism.''' All four LPI counters read zero at every hang; the PHY never entered Low Power Idle. Disabling EEE addresses a state our hardware was never in. (On our setup EEE was advertised but inactive because the switch did not negotiate it, so disabling it also had no measurable power cost — but that is incidental.)
* '''macb TSO/GSO TX-ring hang''' — this is the trigger that '''does''' line up structurally with what we see: a single TX queue, all NET softirqs on CPU0, and the silent-because-<code>trans_start</code>-still-ticks behavior. This is the same family as #7340. However — see the mitigations section — disabling TSO/GSO and enlarging the rings '''did not prevent the hang on our kernel''', so while the structural description fits, the offload-disable ''remedy'' did not hold for us.

'''The e1000e-derived "just turn off TSO" advice does not transfer''' — it originates
from a different NIC and driver, and a lone TSO-off was tried and walked back in the
#7340 discussion. We mention it only because it is widely repeated.

=== What we are NOT claiming ===

* We have not bisected the kernel or independently proven the PCIe-posted-write mechanism that PR #7340's patches address; we are relying on the upstream commits for the mechanism and on our counter capture for the match.
* We have not (yet) demonstrated that the backported fix '''eliminates''' the hang — that trial is in progress and its result is pending (see Status).
* We cannot speak to whether other Pi 5 / CM5 configurations (different PHY link speeds, RPiOS vs Ubuntu, SD vs NVMe) behave identically.

== The upstream fix ==

[https://github.com/raspberrypi/linux/pull/7340 raspberrypi/linux PR #7340] was merged
into the Raspberry Pi Foundation kernel branch <code>rpi-6.18.y</code> on 2026-05-08,
and the series subsequently went to mainline netdev (v2, 2026-05-14). It is three
patches to the Cadence/<code>macb</code> driver:

{| class="wikitable"
! # !! upstream commit !! Change (per the commit) !! Role
|-
| 1 || <code>63d230184da3</code> || Flush the PCIe posted write after the TSTART doorbell, so the doorbell that tells the MAC to begin transmission is not left sitting in the PCIe fabric. || Identified as the core fix
|-
| 2 || <code>3ccf780ce058</code> || Re-check the interrupt status register after re-enabling interrupts in <code>macb_tx_poll</code>. || Closes a lost-interrupt race
|-
| 3 || <code>8ea87c96f1a6</code> || Add a TX-stall watchdog inside the driver. || Defence-in-depth
|}

The mechanism implied by patch 1 — a '''posted MMIO write that can be reordered/delayed
in the PCIe fabric so the MAC never sees the "start TX" doorbell''' — is consistent
with everything we observed: TX stops, RX and interrupts continue, no error is raised,
and the link is unaffected. We present this as the upstream-identified mechanism, not
as something we independently verified at the silicon level.

'''Caveats from the upstream discussion''' (worth knowing if you deploy the fix):

* Upstream labels them '''"candidate fixes."''' Contributors reported that a software watchdog could still occasionally trigger even with the patches applied, and that "both patches + watchdog are necessary to keep things stable."
* Recovery experience varied: a <code>ip link down/up</code> recovered the interface for some reporters but '''not all''' — at least one needed a driver bind/unbind. This matches our finding that a link bounce usually but not always clears it.

== State of the fix in Ubuntu ==

As of '''2026-06-29''', the fix is '''not in Ubuntu's <code>linux-raspi</code>'''. Our
nodes are on the newest available build (<code>7.0.0-1011</code>) and its changelog
contains no trace of the patches. Launchpad bug
[https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 #2133877] remains
in state '''"Confirmed"''' (it has not advanced to Fix Committed / Fix Released). A
Canonical kernel engineer on that bug indicated a 7.0-based <code>linux-raspi</code>
would "take some time" and gave no committed date. So the realistic horizon for an
Ubuntu kernel that fixes this at the source is '''indeterminate (weeks to months)'''.

A way to check for the fix landing without external web access (the package archive is
reachable even when Launchpad may not be):

<syntaxhighlight lang="bash">
apt update && apt-get changelog linux-image-raspi | grep -iE 'TX stall|2133877|TSTART'
</syntaxhighlight>

An empty result means it has not shipped; a match means the SRU is in and you can fix
this by simply updating the kernel.

== Mitigations and their measured results ==

We tried four things. Only two are worth the effort, and the only ''confirmed''
recovery is the watchdog; the only plausible ''cure'' is the backported driver (result
pending). We report what each one actually did, including the ones that failed.

=== 1. Reachability watchdog (recovery) — works; this is the load-bearing mitigation ===

A small periodic probe that pings '''two''' independent off-host targets (we use the
gateway and a NAS). Requiring '''both''' to fail in a cycle, and requiring '''N
consecutive''' failed cycles, avoids false positives from ordinary blips. On trip it:

# Captures a read-only forensic snapshot (the data that made this writeup possible).
# Resets the interface: <code>ip link set eth0 down && ip link set eth0 up</code>. This re-initialises the MAC and, in our experience, usually clears the stall without a reboot.
# If still dead after the bounce, '''reboots''' as a last resort.

Notes that matter:

* It must test '''reachability''', not link/carrier state — carrier stays up, so a link-state monitor will never fire.
* The recovery script and its logs must live on '''local disk''', not on any network filesystem — the network is exactly what is gone when it runs.
* '''Timing has to beat whatever your stack's failure threshold is.''' In our case a slow bounce let the node stay partitioned long enough for the k3s control plane to give up on etcd and restart. We tightened detection (5 s probe cadence, bounce after 2 failed cycles ≈ ~10–15 s to first reset) while keeping the reboot a patient last resort (~75 s). If you run quorum services, tune the bounce to land inside their lease/election timeout.
* A reboot is an acceptable recovery for a node that is one member of a fault-tolerant quorum; it is more disruptive for a singleton. The bounce-first ordering exists to avoid rebooting when a link reset would do.

The kernel's <code>dev_watchdog</code> and the on-board hardware watchdog '''cannot'''
substitute for this — the former never fires (the stall is silent by definition) and
the latter is happy because the host itself is not hung.

=== 2. <code>performance</code> CPU governor — did NOT prevent it ===

Tested specifically to evaluate the frequency-scaling hypothesis. Pinning the cores to
maximum frequency eliminates frequency transitions. The node '''hung again ~30 minutes
after pinning'''. We reverted it. Besides being ineffective here, it carries a small
but continuous idle power cost (we measured roughly +0.5–0.75 W on a fanless ~3 W-class
node). '''Recommendation: do not rely on this''', and treat it as evidence against the
frequency-transition mechanism rather than a mitigation.

=== 3. EEE / TSO / GSO off + larger rings — did NOT prevent it on our kernel ===

This is the siderolabs #91 prevention bundle:

<syntaxhighlight lang="bash">
ethtool --set-eee eth0 eee off ; ethtool --set-eee eth0 advertise 0x0
ethtool -K eth0 tso off gso off
ethtool -G eth0 rx 4096 tx 2048
</syntaxhighlight>

We applied and verified all of it (confirmed active at the time of a subsequent hang:
EEE disabled, both offloads off, rings 4096/2048). '''The node hung anyway.''' On our
kernel this bundle did not prevent the stall. We left it applied (it is close to free
on our hardware — see the EEE note above) but we '''cannot report it as effective
prevention'''. Reporters on other kernels (notably pre-#7340 builds) found it stable on
a fleet, so results appear to be kernel-dependent; do not assume it will hold on
<code>linux-raspi</code> 7.0.

If you do apply it, note that a link down/up '''resets''' these settings, so any
watchdog that bounces the interface must re-assert them afterward.

=== 4. Backported <code>macb.ko</code> (the actual fix) — deployed, result pending ===

Because the upstream fix is a driver change and <code>macb</code> is a loadable module
(<code>CONFIG_MACB=m</code>) on the Ubuntu kernel, the three #7340 patches can be
backported and the driver rebuilt '''without replacing the kernel'''. The patches
applied cleanly to the <code>linux-raspi</code> 7.0 source. This is the only mitigation
that targets the root cause rather than recovering from or trying to avoid it.

We load it '''live and non-persistently''': the rebuilt module lives outside
<code>/lib/modules</code> and the initramfs and is inserted at runtime, so '''every
reboot returns to the stock driver automatically'''. With root on NVMe (so the NIC
driver is not boot-critical) this is low-risk — a bad module cannot wedge boot, because
boot never loads it. A loader script does the swap (verify the module's vermagic
matches the running kernel, <code>rmmod</code> stock → <code>insmod</code> patched →
bring the link up → wait for a real off-host ping → roll back to stock on any failure),
with the watchdog as the final backstop.

Two operational warnings if you try this:

* '''A live <code>macb</code> reload destroys and recreates <code>eth0</code>.''' If you run an overlay network (we run flannel/VXLAN), the overlay device is bound to the old interface and may '''not rebuild itself''' — leaving the host with full LAN connectivity but no overlay (cross-node traffic silently fails). The fix is to restart the networking layer that owns the overlay after the swap. This bites both the swap and a manual <code>rmmod; modprobe</code> revert.
* '''The module is bound to one exact kernel version''' (vermagic). Any kernel upgrade obsoletes it and it must be rebuilt and re-swapped. A reboot after a kernel upgrade but before a re-swap simply runs the stock driver — safe, just unprotected. If you make this permanent, '''DKMS''' is the right vehicle because it auto-rebuilds the module on each kernel bump. We have not done this yet — it is gated on the soak result below.

The module loads tainted (out-of-tree + unsigned, taint mask <code>12288</code>), as
expected.

==== Early result (preliminary, not a conclusion) ====

We deployed the patched driver to two nodes first and left a third on the stock driver
as a control. In the following window the '''stock node hung twice''' overnight while
the '''two patched nodes accumulated roughly 25 node-hours with no hang'''. That is a
real differential and is encouraging, but it is a '''small sample''' and the control
node was subsequently patched as well (it had started hanging, removing its value as a
"never-hung" control). '''We are not yet claiming the patch works.''' The standing
test is a multi-week soak across all nodes against the pre-patch hang rate.

== How to confirm you have this specific bug ==

After a recovery, examine the boot that failed (e.g. <code>journalctl -b -1</code>):

<syntaxhighlight lang="bash">
ethtool -i eth0 # driver should be 'macb'
journalctl -b -1 | grep -iE 'Link is Down|carrier' # expect NOTHING for eth0 (link never dropped)
journalctl -b -1 | grep -iE 'i/o timeout|not responding, timed out' # all egress dead at once
journalctl -b -1 | grep -icE 'oom-kill|killed process' # expect 0 (rules out memory)
journalctl -b -1 | grep -icE 'rcu.*stall' # we saw 0 (the #2133877 precursor)
ethtool -S eth0 | grep -vE ': 0$' # at hang time, error/drop counters are 0
</syntaxhighlight>

If you can catch it '''live''' (before resetting the interface), the conclusive check
is to sample the frame counters a second apart and confirm TX is frozen while RX moves:

<syntaxhighlight lang="bash">
for i in 1 2 3; do
ethtool -S eth0 | grep -E 'tx_frames|rx_frames'; echo ---; sleep 1
done
# tx_frames identical across samples + rx_frames increasing = the silent TX stall
</syntaxhighlight>

== Status and open questions (2026-06-29) ==

* '''Confirmed:''' the failure is a silent TX stall (captured: TX frozen, RX live), it is not memory/thermal/PHY/cable, and it recovers on interface reset.
* '''Confirmed not the cause for us:''' EEE/LPI (counters zero); frequency transitions as a ''necessary'' trigger (hang occurs at pinned max frequency); the RCU-stall precursor from #2133877 (never observed).
* '''Did not prevent it (our kernel):''' <code>performance</code> governor; EEE/TSO/GSO-off + larger rings.
* '''Works:''' the reachability watchdog (recovery, not prevention).
* '''Pending:''' whether the backported #7340 driver eliminates the hang — multi-week soak in progress; persistence via DKMS deferred until that passes.
* '''The real fix''' remains an Ubuntu SRU of #7340 into <code>linux-raspi</code>; track Launchpad #2133877 for ''Fix Released''.

== References ==

* raspberrypi/linux #7339 — silent TX stall on Pi 5 / CM5 (root-cause issue): [https://github.com/raspberrypi/linux/issues/7339 github.com/raspberrypi/linux/issues/7339]
* raspberrypi/linux PR #7340 — <code>net: macb: candidate fixes for silent TX stall on BCM2712/RP1</code> (merged to <code>rpi-6.18.y</code> 2026-05-08): [https://github.com/raspberrypi/linux/pull/7340 github.com/raspberrypi/linux/pull/7340] · netdev series: [https://lore.kernel.org/netdev/cover.1777064117.git.lukasz@raczylo.com/ lore.kernel.org]
* Ubuntu <code>linux-raspi</code> bug #2133877 — Complete network hang on Pi 5 (watch for ''Fix Released''): [https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 bugs.launchpad.net/…/2133877]
* siderolabs/sbc-raspberrypi #91 — silent network death on Pi 5 (Talos): 3-trigger decomposition + the EEE/TSO/GSO/rings bundle: [https://github.com/siderolabs/sbc-raspberrypi/issues/91 github.com/siderolabs/sbc-raspberrypi/issues/91]
* raspberrypi/linux #6420 — Pi 5 sometimes has no LAN after boot (a ''different'', boot-time variant): [https://github.com/raspberrypi/linux/issues/6420 github.com/raspberrypi/linux/issues/6420]
* Intel e1000e "Detected Hardware Unit Hang" — origin of the unrelated "disable TSO" advice (different NIC/driver, does not transfer): [https://forum.proxmox.com/threads/e1000-driver-hang.58284/ forum.proxmox.com/…/e1000-driver-hang]

Cm5 macb network hang - Revision history

Drew: first long form summary, claude produced from notes