Kernel panic induced by wireless network usage? #77
Labels
No labels
CI enhacement
CS10 (chromestick)
HIGH PRIOROITY
Low Priority
Solved
TODO
arm64
armhf
bug
c100 (veyron minnie)
duplicate
enhancement
good first issue
help wanted
invalid
minor bug
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
ev4/PrawnOS#77
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I compiled the master branch in a clean Debian vm, ran the
InstallToInternal.shand i believe that wireless usage is inducing kernel panics on a fairly clean install (no DE's, manual wpa association & dhclient). Sometimes it happens during association, other times during usage. I have attached some pictures of the kernel panics.There don't seem to be any logs that i can find of these events. If there's any way i can provide more information please let me know!
Thats no good. Must be due to a regression in the kernel between kernel 4.17.2 and 4.17.19.
Looks like interrupt request handling is what ends up panicing.
For now, you can switch your checkout to commit
6333149282to use 4.17.2 instead of 4.17.9.Going through for sanities sake:
No device tree changes, no changes to the open wifi firmware, dma and ath kernel drivers are mostly unchanged.
Some dma handling changes in dwc2, so I'll test reverting the dwc2 tree.
The cros_ec spi/i2c drivers are unchanged.
There are some changes in the i2c driver, and given the panics I'll test reverting that tree as well.
This commit in the touchpad driver could also be at fault
f1f3d22d65f1e657826f5515b6b6b38728082d9a
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/input/mouse/?h=v4.17.19&id=f1f3d22d65f1e657826f5515b6b6b38728082d9a
@tslilc
do you happen to see errors similar to these in the kernel logs before the panic, or even just randomly?
To make logging and debugging hangs easier, I do this https://github.com/SolidHal/PrawnOS/wiki/Using-the-debug-usb-uart-serial-on-the-Asus-C201
@SolidHal
Thanks for your continued and amazing work on this project. Already i have a 90% functional libre laptop and couldn't be happier.
Yes, i can confirm that i see these types of errors both randomly and leading up to the panic. The numbers are a little different, but i don't think that's a relevant difference. It seems to happen more often when the USB wifi is plugged into the port closest to the screen.
Thanks for the advice, i'll try to test whether this happens with 4.17.2, thanks!
@tslilc Thank you for the kind words! I hope to make it a 100% functional libre laptop!
Alright, then I am seeing the same issue in my 4.19.15 tests.
I noticed the same, probably a dwc2 oddity.
Heres my steps for reproducing the issue reliably:
Using kernel 4.19.15 with dwc2 and ath trees from 4.17.2 I can reliably download the debian dvd image multiple times. The ath tree alone didn't cut it, and I'm not convinced it is needed so I'll test without it.
One caveat, with the 4.17.2 trees and
USB_DWC2_DEBUGset in the kernel config tons of these messages are thrownand seem to replace the
dwc2_hc_chhltd_intr_dmaerrors seen with later versions dwc2 trees.This points to changes in dwc2 between 4.17.2 and 4.17.19 that make these transaction errors into a more noticeable issue. Unsure if the transaction errors result in data corruption. Testing this.
One note on usb transaction errors is that the are allowable by spec and should not result in data corruption, so if ath and dwc2 are written correctly transaction errors aren't a huge concern.
One important thing: The driver used with wpa_supplicant.
If I use
wextit hangs even with the 4.17.2 dwc2 tree. Withnl80211I get far fewer transaction errors and it doesn't seem to panic.I haven't manage to reproduce this issue using 4.17.19
@Anthony-Sensors Huh, are you using the repo as-is or do you have some modifications?
EDIT: Also, are you using the same ath9k download in /build/ that you were using previously to build 4.17.2?
I'm using your release alpha version 6. I'm using wireless on usb port closest to me. I haven't experience this issue yet.
Looks like the issue I was actually experiencing was #83. Now that I have that figured out, I can try to figure out why this issue happens.
Sucks when the debug tools have issues.
@tslilc Did you happen to specify a driver with
-Dwhen running wpa_supplicant?I'm finding some correlation between the nl80211 driver and this crash.
@solidhal, i was using wext. Based on regular usage these past few days
nl80211 seems to have mitigated the issue. Thanks!
@SolidHal, i should say that some time ago i installed a (3.3V! no need for any extra wiring) AR9271 usb WiFi adapter to the webcam connector and i haven't had any issues at all -- even with 4.17.19 from your development branch -- on both wext (tested a little) and nl80211 (tested far more). Could this be something about the external USB ports?
@tslilc Yeah, that's part of the reason I didn't notice that this issue has existed since the 4.17.2 releases. The bug is due to how the dwc2 drivers, which handle the usb ports and the ath9k devices interact.
I've been debugging it when I have the free time, but its slow going.
@SolidHal i see. Well i'm certainly grateful for your continued efforts!
Unfortunately, for the time being, i think this sort of hardware hacking and debugging is somewhat above my head.
I think I've completed this chase. Moving ipv6 back in to the kernel instead of building it as a module seems to fix this. The other issues I was experiencing seem to be a bug in enabling the dwc2 periodic debug and SOF debugging, which is annoying.
With the image I'm about to push as a release I was able to download files continuously overnight using the chromium browser from debian unstable
apt install -t unstable chromiumI chose to use this over firefox-esr as all of the available firefox-esr builds are still buggy in weird places that I don't want to dig in to right now.
I also set all of the sleep and display turn off sliders to never in the settings.
@SolidHal thanks for your hard work on this. Unfortunately i’m travelling right now (with the c201) and so don’t have access to an external USB WiFi device to test. I’ll be sure to try it when i’m back though. Thanks again!
I believe I've gotten the same problem as @tslilc. I was trying your Alpha 9 release, with XFCE. Clean install, resize, reboot and try to associate to wifi on first login. System completely freezes and needs a hard shutdown. On reboot everything seems to work, can open apps, mount hard drives. I'll try building an image based on
6333149282, as you suggested@robinde, could you share what brand/model of ath9271 dongle you have?
I'm haven't been able to recreate these crashes on version 9 unfortunately, probably just getting lucky. I did come accross two arm cpu errata that the chrome os team tried to get mainlined that fix hangs on the rk3288. https://patchwork.kernel.org/patch/10909833/ and https://patchwork.kernel.org/patch/10909835/.
Maybe the wireless device is causing the specific cpu states that they refer to?
I've pulled them in and moved up to 4.19.53 in the latest release. I also disabled most power management to see if that is causing it.
I tested with what will be alpha version 10 for 7 hours, and haven't had a crash yet (although thats not any different than my experience with the previous version.)
When version 10 finished uploading, could you guys test it out when you have a chance? @tslilc @robinde @ifbizo
My test process for anyone that is interested is:
ping <some ip>as the debian downloads finish, delete them and queue them up again.
I'm not sure if this is helpful, but during the InstallPackages script, with the ar9271 plugged in, I got this panic:.
My main issue though continues to be #95 even with Alpha 10, even without the ar9271 plugged in. I'm not convinced there isn't a local hardware issue with my machine.
@tehbra1n Yes it is! Thank you. If you finished the install after that, was the wireless working?
That seems to be a different panic than what tslilc was experiencing, so definitely interesting...
If it happens again, and the system is usable can you capture the output of
sudo dmesgand upload it here?After fixing my trackpad I moved on to alpha 11 with no repeat of that kind of panic.
I tested out Alpha 11 this weekend, seems like things are working really well. I was using the TPE-N150USB, which has an AR9271 chipset.
No crashes, good throughput, seems stable.
This issue and #102 refer to the same problem. Since this one is a bit older, and many of the logs predate quite a few fixes I'm going to close this one and keep #102 which contains more recent logs.
Please post any updates to #102.