Cyclic hardware reset for e1000e

Per Oberg pero at
Mon Feb 18 14:08:55 CET 2019

----- Den 18 feb 2019, på kl 13:43, Jan Kiszka jan.kiszka at skrev:

> On 18.02.19 13:36, Per Oberg via Xenomai wrote:
> > Hello list

>> I have this issue where my e1000e network card gets into some kind of cyclic
>> hardware reset during operation. The weird thing is that this only happens when
>> I let systemd start the application. If it's started manually it always works
> > as intended.

>> I am running xenomai 3.0.7 with a linux-4.9.38 kernel and I use the network
> > connection in Linux non-rt mode. I use systemd and NetworkManager.

>> I do realize that once I get into the reset it will continue resetting because I
>> keep flooding the buffers. My issue is that it -never- happens when I start my
>> process manually, only when systemd starts it. Because the network goes down
>> quite badly I cannot log in and disable the service once it happens and
>> therefore I cannot really try starting it manually after letting the network
> > recover.

>> There is some information from intel in [1] below. There is talk about power
> > management function and EPROM etc. They specifically write:

> > "82573(V/L/E) TX Unit Hang Messages
>> Several adapters with the 82573 chipset display "TX unit hang" messages during
>> normal operation with the e1000 driver. The issue appears both with TSO enabled
>> and disabled, and is caused by a power management function that is enabled in
>> the EEPROM. Early releases of the chipsets to vendors had the EEPROM bit that
>> enabled the feature. After the issue was discovered newer adapters were
> > released with the feature disabled in the EEPROM."

> > I also read something about disabling GRO/TSO/GSO that helped some people.

> > My questions to the list are:

> > 1. Have you guys any experience with this?
> > 2. Would I be better of using the RT Net drivers?
>> 3. What could cause the issue to trigger only when run by systemd. (I thought
> > about timing issues and NetworkManager, but how do I debug this?)

>> [1]
> >

> > Thoughts anyone?

> Are you giving Linux enough time to work (no 100% RT domination of any core for
> hundreds of milliseconds or longer)?

I am not sure, yet. I have this logging function for reporting back to me when I loose samples. Loosing samples would currently make the software try to catch up and this would mean 100% cpu till it does. I do see this being logged around the time it resets but I'm not sure if it's much worse than "usual". If for some reason the hardware reset happens because linux gets starved I can easily see this going cyclic.

Per Öberg 

More information about the Xenomai mailing list