Kernel panic induced by wireless network usage? #77

Closed
opened 2019-01-19 02:49:01 +01:00 by tslilc · 27 comments
tslilc commented 2019-01-19 02:49:01 +01:00 (Migrated from github.com)

I compiled the master branch in a clean Debian vm, ran the InstallToInternal.sh and i believe that wireless usage is inducing kernel panics on a fairly clean install (no DE's, manual wpa association & dhclient). Sometimes it happens during association, other times during usage. I have attached some pictures of the kernel panics.

There don't seem to be any logs that i can find of these events. If there's any way i can provide more information please let me know!

img-20190118-202751
kp2
kp3

I compiled the master branch in a clean Debian vm, ran the `InstallToInternal.sh` and i believe that wireless usage is inducing kernel panics on a fairly clean install (no DE's, manual wpa association & dhclient). Sometimes it happens during association, other times during usage. I have attached some pictures of the kernel panics. There don't seem to be any logs that i can find of these events. If there's any way i can provide more information please let me know! ![img-20190118-202751](https://user-images.githubusercontent.com/1778139/51420789-657e5080-1b63-11e9-8f61-72e8b652e3dd.jpg) ![kp2](https://user-images.githubusercontent.com/1778139/51420790-657e5080-1b63-11e9-8bcc-af0af615c8d8.jpg) ![kp3](https://user-images.githubusercontent.com/1778139/51420791-657e5080-1b63-11e9-8469-766f498ea32a.jpg)
SolidEva commented 2019-01-22 15:58:48 +01:00 (Migrated from github.com)

Thats no good. Must be due to a regression in the kernel between kernel 4.17.2 and 4.17.19.

Looks like interrupt request handling is what ends up panicing.

For now, you can switch your checkout to commit 6333149282 to use 4.17.2 instead of 4.17.9.

Thats no good. Must be due to a regression in the kernel between kernel 4.17.2 and 4.17.19. Looks like interrupt request handling is what ends up panicing. For now, you can switch your checkout to commit 633314928242979de3c6f315eefc86aef134b91f to use 4.17.2 instead of 4.17.9.
SolidEva commented 2019-01-22 16:46:14 +01:00 (Migrated from github.com)

Going through for sanities sake:
No device tree changes, no changes to the open wifi firmware, dma and ath kernel drivers are mostly unchanged.

Some dma handling changes in dwc2, so I'll test reverting the dwc2 tree.
The cros_ec spi/i2c drivers are unchanged.

There are some changes in the i2c driver, and given the panics I'll test reverting that tree as well.

Going through for sanities sake: No device tree changes, no changes to the open wifi firmware, dma and ath kernel drivers are mostly unchanged. Some dma handling changes in dwc2, so I'll test reverting the dwc2 tree. The cros_ec spi/i2c drivers are unchanged. There are some changes in the i2c driver, and given the panics I'll test reverting that tree as well.
SolidEva commented 2019-01-22 17:38:13 +01:00 (Migrated from github.com)

This commit in the touchpad driver could also be at fault
f1f3d22d65f1e657826f5515b6b6b38728082d9a

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/input/mouse/?h=v4.17.19&id=f1f3d22d65f1e657826f5515b6b6b38728082d9a

This commit in the touchpad driver could also be at fault f1f3d22d65f1e657826f5515b6b6b38728082d9a https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/input/mouse/?h=v4.17.19&id=f1f3d22d65f1e657826f5515b6b6b38728082d9a
SolidEva commented 2019-01-23 21:40:14 +01:00 (Migrated from github.com)

@tslilc

do you happen to see errors similar to these in the kernel logs before the panic, or even just randomly?

dwc2 f72c0000.usb: dwc2_hc_chhltd_intr_dma: Channel 0 - ChHltd set,
but reason is unknown
dwc2 f72c0000.usb: hcint 0x00000002, intsts 0x06200029

dwc2 f72c0000.usb: dwc2_hc_chhltd_intr_dma: Channel 11 - ChHltd set,
but reason is unknown
dwc2 f72c0000.usb: hcint 0x00000002, intsts 0x04200029

To make logging and debugging hangs easier, I do this https://github.com/SolidHal/PrawnOS/wiki/Using-the-debug-usb-uart-serial-on-the-Asus-C201

@tslilc do you happen to see errors similar to these in the kernel logs before the panic, or even just randomly? ``` dwc2 f72c0000.usb: dwc2_hc_chhltd_intr_dma: Channel 0 - ChHltd set, but reason is unknown dwc2 f72c0000.usb: hcint 0x00000002, intsts 0x06200029 dwc2 f72c0000.usb: dwc2_hc_chhltd_intr_dma: Channel 11 - ChHltd set, but reason is unknown dwc2 f72c0000.usb: hcint 0x00000002, intsts 0x04200029 ``` To make logging and debugging hangs easier, I do this https://github.com/SolidHal/PrawnOS/wiki/Using-the-debug-usb-uart-serial-on-the-Asus-C201
tslilc commented 2019-01-24 00:50:40 +01:00 (Migrated from github.com)

@SolidHal

Thanks for your continued and amazing work on this project. Already i have a 90% functional libre laptop and couldn't be happier.

Yes, i can confirm that i see these types of errors both randomly and leading up to the panic. The numbers are a little different, but i don't think that's a relevant difference. It seems to happen more often when the USB wifi is plugged into the port closest to the screen.

Thanks for the advice, i'll try to test whether this happens with 4.17.2, thanks!

@SolidHal Thanks for your continued and amazing work on this project. Already i have a 90% functional libre laptop and couldn't be happier. Yes, i can confirm that i see these types of errors both randomly and leading up to the panic. The numbers are a little different, but i don't think that's a relevant difference. It seems to happen more often when the USB wifi is plugged into the port closest to the screen. Thanks for the advice, i'll try to test whether this happens with 4.17.2, thanks!
SolidEva commented 2019-01-24 02:46:06 +01:00 (Migrated from github.com)

@tslilc Thank you for the kind words! I hope to make it a 100% functional libre laptop!

Alright, then I am seeing the same issue in my 4.19.15 tests.

It seems to happen more often when the USB wifi is plugged into the port closest to the screen

I noticed the same, probably a dwc2 oddity.

@tslilc Thank you for the kind words! I hope to make it a 100% functional libre laptop! Alright, then I am seeing the same issue in my 4.19.15 tests. > It seems to happen more often when the USB wifi is plugged into the port closest to the screen I noticed the same, probably a dwc2 oddity.
SolidEva commented 2019-01-24 03:14:32 +01:00 (Migrated from github.com)

Heres my steps for reproducing the issue reliably:

  1. Connect to a wireless network
  2. if that doesn't cause issue, download a large file like a 4GB debian dvd image.

Using kernel 4.19.15 with dwc2 and ath trees from 4.17.2 I can reliably download the debian dvd image multiple times. The ath tree alone didn't cut it, and I'm not convinced it is needed so I'll test without it.

One caveat, with the 4.17.2 trees and USB_DWC2_DEBUG set in the kernel config tons of these messages are thrown

[ 59.285317] dwc2 ff540000.usb: --Host Channel 7 Interrupt: Transaction Error--
[  162.662343] dwc2 ff540000.usb: --Host Channel 8 Interrupt: Transaction Error--
[  163.035968] dwc2 ff540000.usb: --Host Channel 15 Interrupt: Transaction Error--
[  173.322006] dwc2 ff540000.usb: --Host Channel 5 Interrupt: Transaction Error--
[  174.491225] dwc2 ff540000.usb: --Host Channel 14 Interrupt: Transaction Error--
[  180.492403] dwc2 ff540000.usb: --Host Channel 4 Interrupt: Transaction Error--

and seem to replace the dwc2_hc_chhltd_intr_dma errors seen with later versions dwc2 trees.

This points to changes in dwc2 between 4.17.2 and 4.17.19 that make these transaction errors into a more noticeable issue. Unsure if the transaction errors result in data corruption. Testing this.

One note on usb transaction errors is that the are allowable by spec and should not result in data corruption, so if ath and dwc2 are written correctly transaction errors aren't a huge concern.

Heres my steps for reproducing the issue reliably: 1) Connect to a wireless network 2) if that doesn't cause issue, download a large file like a 4GB debian dvd image. Using kernel 4.19.15 with dwc2 and ath trees from 4.17.2 I can reliably download the debian dvd image multiple times. The ath tree alone didn't cut it, and I'm not convinced it is needed so I'll test without it. One caveat, with the 4.17.2 trees and `USB_DWC2_DEBUG` set in the kernel config tons of these messages are thrown ``` [ 59.285317] dwc2 ff540000.usb: --Host Channel 7 Interrupt: Transaction Error-- [ 162.662343] dwc2 ff540000.usb: --Host Channel 8 Interrupt: Transaction Error-- [ 163.035968] dwc2 ff540000.usb: --Host Channel 15 Interrupt: Transaction Error-- [ 173.322006] dwc2 ff540000.usb: --Host Channel 5 Interrupt: Transaction Error-- [ 174.491225] dwc2 ff540000.usb: --Host Channel 14 Interrupt: Transaction Error-- [ 180.492403] dwc2 ff540000.usb: --Host Channel 4 Interrupt: Transaction Error-- ``` and seem to replace the `dwc2_hc_chhltd_intr_dma` errors seen with later versions dwc2 trees. This points to changes in dwc2 between 4.17.2 and 4.17.19 that make these transaction errors into a more noticeable issue. Unsure if the transaction errors result in data corruption. Testing this. One note on usb transaction errors is that the are allowable by spec and should not result in data corruption, so if ath and dwc2 are written correctly transaction errors aren't a huge concern.
SolidEva commented 2019-01-26 02:41:30 +01:00 (Migrated from github.com)

One important thing: The driver used with wpa_supplicant.
If I use wext it hangs even with the 4.17.2 dwc2 tree. With nl80211 I get far fewer transaction errors and it doesn't seem to panic.

One important thing: The driver used with wpa_supplicant. If I use `wext` it hangs even with the 4.17.2 dwc2 tree. With `nl80211` I get far fewer transaction errors and it doesn't seem to panic.
Anthony-Sensors commented 2019-02-17 19:37:39 +01:00 (Migrated from github.com)

I haven't manage to reproduce this issue using 4.17.19

I haven't manage to reproduce this issue using 4.17.19
SolidEva commented 2019-02-19 17:07:48 +01:00 (Migrated from github.com)

@Anthony-Sensors Huh, are you using the repo as-is or do you have some modifications?
EDIT: Also, are you using the same ath9k download in /build/ that you were using previously to build 4.17.2?

@Anthony-Sensors Huh, are you using the repo as-is or do you have some modifications? EDIT: Also, are you using the same ath9k download in /build/ that you were using previously to build 4.17.2?
Anthony-Sensors commented 2019-02-19 18:40:49 +01:00 (Migrated from github.com)

I'm using your release alpha version 6. I'm using wireless on usb port closest to me. I haven't experience this issue yet.

I'm using your release alpha version 6. I'm using wireless on usb port closest to me. I haven't experience this issue yet.
SolidEva commented 2019-02-25 22:05:09 +01:00 (Migrated from github.com)

Looks like the issue I was actually experiencing was #83. Now that I have that figured out, I can try to figure out why this issue happens.

Sucks when the debug tools have issues.

Looks like the issue I was actually experiencing was #83. Now that I have that figured out, I can try to figure out why this issue happens. Sucks when the debug tools have issues.
SolidEva commented 2019-03-07 21:53:08 +01:00 (Migrated from github.com)

@tslilc Did you happen to specify a driver with -D when running wpa_supplicant?
I'm finding some correlation between the nl80211 driver and this crash.

@tslilc Did you happen to specify a driver with `-D` when running wpa_supplicant? I'm finding some correlation between the nl80211 driver and this crash.
tslilc commented 2019-03-11 15:38:57 +01:00 (Migrated from github.com)

@solidhal, i was using wext. Based on regular usage these past few days
nl80211 seems to have mitigated the issue. Thanks!

@solidhal, i was using wext. Based on regular usage these past few days nl80211 seems to have mitigated the issue. Thanks!
tslilc commented 2019-03-16 18:15:19 +01:00 (Migrated from github.com)

@SolidHal, i should say that some time ago i installed a (3.3V! no need for any extra wiring) AR9271 usb WiFi adapter to the webcam connector and i haven't had any issues at all -- even with 4.17.19 from your development branch -- on both wext (tested a little) and nl80211 (tested far more). Could this be something about the external USB ports?

@SolidHal, i should say that some time ago i installed a (3.3V! no need for any extra wiring) AR9271 usb WiFi adapter to the webcam connector and i haven't had any issues at all -- even with 4.17.19 from your development branch -- on both wext (tested a little) and nl80211 (tested far more). Could this be something about the external USB ports?
SolidEva commented 2019-03-17 03:00:47 +01:00 (Migrated from github.com)

@tslilc Yeah, that's part of the reason I didn't notice that this issue has existed since the 4.17.2 releases. The bug is due to how the dwc2 drivers, which handle the usb ports and the ath9k devices interact.
I've been debugging it when I have the free time, but its slow going.

@tslilc Yeah, that's part of the reason I didn't notice that this issue has existed since the 4.17.2 releases. The bug is due to how the dwc2 drivers, which handle the usb ports and the ath9k devices interact. I've been debugging it when I have the free time, but its slow going.
tslilc commented 2019-03-18 21:34:09 +01:00 (Migrated from github.com)

@SolidHal i see. Well i'm certainly grateful for your continued efforts!

Unfortunately, for the time being, i think this sort of hardware hacking and debugging is somewhat above my head.

@SolidHal i see. Well i'm certainly grateful for your continued efforts! Unfortunately, for the time being, i think this sort of hardware hacking and debugging is somewhat above my head.
SolidEva commented 2019-05-22 17:52:13 +02:00 (Migrated from github.com)

I think I've completed this chase. Moving ipv6 back in to the kernel instead of building it as a module seems to fix this. The other issues I was experiencing seem to be a bug in enabling the dwc2 periodic debug and SOF debugging, which is annoying.

With the image I'm about to push as a release I was able to download files continuously overnight using the chromium browser from debian unstable apt install -t unstable chromium
I chose to use this over firefox-esr as all of the available firefox-esr builds are still buggy in weird places that I don't want to dig in to right now.

I also set all of the sleep and display turn off sliders to never in the settings.

I think I've completed this chase. Moving ipv6 back in to the kernel instead of building it as a module seems to fix this. The other issues I was experiencing seem to be a bug in enabling the dwc2 periodic debug and SOF debugging, which is annoying. With the image I'm about to push as a release I was able to download files continuously overnight using the chromium browser from debian unstable `apt install -t unstable chromium` I chose to use this over firefox-esr as all of the available firefox-esr builds are still buggy in weird places that I don't want to dig in to right now. I also set all of the sleep and display turn off sliders to never in the settings.
tslilc commented 2019-05-27 14:24:06 +02:00 (Migrated from github.com)

@SolidHal thanks for your hard work on this. Unfortunately i’m travelling right now (with the c201) and so don’t have access to an external USB WiFi device to test. I’ll be sure to try it when i’m back though. Thanks again!

@SolidHal thanks for your hard work on this. Unfortunately i’m travelling right now (with the c201) and so don’t have access to an external USB WiFi device to test. I’ll be sure to try it when i’m back though. Thanks again!
ergofroggy commented 2019-06-21 09:52:30 +02:00 (Migrated from github.com)

I believe I've gotten the same problem as @tslilc. I was trying your Alpha 9 release, with XFCE. Clean install, resize, reboot and try to associate to wifi on first login. System completely freezes and needs a hard shutdown. On reboot everything seems to work, can open apps, mount hard drives. I'll try building an image based on 6333149282, as you suggested

I believe I've gotten the same problem as @tslilc. I was trying your Alpha 9 release, with XFCE. Clean install, resize, reboot and try to associate to wifi on first login. System completely freezes and needs a hard shutdown. On reboot everything seems to work, can open apps, mount hard drives. I'll try building an image based on 633314928242979de3c6f315eefc86aef134b91f, as you suggested
SolidEva commented 2019-06-21 17:22:53 +02:00 (Migrated from github.com)

@robinde, could you share what brand/model of ath9271 dongle you have?

@robinde, could you share what brand/model of ath9271 dongle you have?
SolidEva commented 2019-06-23 00:44:53 +02:00 (Migrated from github.com)

I'm haven't been able to recreate these crashes on version 9 unfortunately, probably just getting lucky. I did come accross two arm cpu errata that the chrome os team tried to get mainlined that fix hangs on the rk3288. https://patchwork.kernel.org/patch/10909833/ and https://patchwork.kernel.org/patch/10909835/.

Maybe the wireless device is causing the specific cpu states that they refer to?
I've pulled them in and moved up to 4.19.53 in the latest release. I also disabled most power management to see if that is causing it.

I tested with what will be alpha version 10 for 7 hours, and haven't had a crash yet (although thats not any different than my experience with the previous version.)

When version 10 finished uploading, could you guys test it out when you have a chance? @tslilc @robinde @ifbizo

My test process for anyone that is interested is:

  1. Boot up the laptop
  2. log in to the xfce install
  3. Plug in the ar9271
  4. Use the xfce network manager to connect to a network
  5. open firefox, stream video on autoplay
  6. in another tab, download all three parts of a debian 9 amd64 release
  7. in a terminal, run ping <some ip>

as the debian downloads finish, delete them and queue them up again.

I'm haven't been able to recreate these crashes on version 9 unfortunately, probably just getting lucky. I did come accross two arm cpu errata that the chrome os team tried to get mainlined that fix hangs on the rk3288. https://patchwork.kernel.org/patch/10909833/ and https://patchwork.kernel.org/patch/10909835/. Maybe the wireless device is causing the specific cpu states that they refer to? I've pulled them in and moved up to 4.19.53 in the latest release. I also disabled most power management to see if that is causing it. I tested with what will be alpha version 10 for 7 hours, and haven't had a crash yet (although thats not any different than my experience with the previous version.) When version 10 finished uploading, could you guys test it out when you have a chance? @tslilc @robinde @ifbizo My test process for anyone that is interested is: 1) Boot up the laptop 2) log in to the xfce install 3) Plug in the ar9271 4) Use the xfce network manager to connect to a network 5) open firefox, stream video on autoplay 6) in another tab, download all three parts of a debian 9 amd64 release 7) in a terminal, run `ping <some ip>` as the debian downloads finish, delete them and queue them up again.
tehbra1n commented 2019-06-23 20:34:41 +02:00 (Migrated from github.com)

I'm not sure if this is helpful, but during the InstallPackages script, with the ar9271 plugged in, I got this panic:.

image

My main issue though continues to be #95 even with Alpha 10, even without the ar9271 plugged in. I'm not convinced there isn't a local hardware issue with my machine.

I'm not sure if this is helpful, but during the InstallPackages script, with the ar9271 plugged in, I got this panic:. ![image](https://user-images.githubusercontent.com/17303996/59980410-986dba00-95c3-11e9-97f1-42dcef58a0cf.png) My main issue though continues to be #95 even with Alpha 10, even without the ar9271 plugged in. I'm not convinced there isn't a local hardware issue with my machine.
SolidEva commented 2019-06-24 19:30:05 +02:00 (Migrated from github.com)

@tehbra1n Yes it is! Thank you. If you finished the install after that, was the wireless working?
That seems to be a different panic than what tslilc was experiencing, so definitely interesting...

If it happens again, and the system is usable can you capture the output of sudo dmesg and upload it here?

@tehbra1n Yes it is! Thank you. If you finished the install after that, was the wireless working? That seems to be a different panic than what tslilc was experiencing, so definitely interesting... If it happens again, and the system is usable can you capture the output of `sudo dmesg` and upload it here?
tehbra1n commented 2019-06-27 15:47:59 +02:00 (Migrated from github.com)

@tehbra1n Yes it is! Thank you. If you finished the install after that, was the wireless working?
That seems to be a different panic than what tslilc was experiencing, so definitely interesting...

If it happens again, and the system is usable can you capture the output of sudo dmesg and upload it here?

After fixing my trackpad I moved on to alpha 11 with no repeat of that kind of panic.

> @tehbra1n Yes it is! Thank you. If you finished the install after that, was the wireless working? > That seems to be a different panic than what tslilc was experiencing, so definitely interesting... > > If it happens again, and the system is usable can you capture the output of `sudo dmesg` and upload it here? After fixing my trackpad I moved on to alpha 11 with no repeat of that kind of panic.
ergofroggy commented 2019-07-01 10:59:49 +02:00 (Migrated from github.com)

I tested out Alpha 11 this weekend, seems like things are working really well. I was using the TPE-N150USB, which has an AR9271 chipset.
No crashes, good throughput, seems stable.

I tested out Alpha 11 this weekend, seems like things are working really well. I was using the TPE-N150USB, which has an AR9271 chipset. No crashes, good throughput, seems stable.
SolidEva commented 2019-07-12 22:44:48 +02:00 (Migrated from github.com)

This issue and #102 refer to the same problem. Since this one is a bit older, and many of the logs predate quite a few fixes I'm going to close this one and keep #102 which contains more recent logs.

Please post any updates to #102.

This issue and #102 refer to the same problem. Since this one is a bit older, and many of the logs predate quite a few fixes I'm going to close this one and keep #102 which contains more recent logs. Please post any updates to #102.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ev4/PrawnOS#77
No description provided.