Random Core Panics only on some LoPy4 units



  • Hello everyone:

    First time posting here, but I've been working with Pycom and LoPy4 for months now. We are using the LoPy4 in diesel pumps to log telemetry data and send it via LoRaWAN to a central gateway. The LoPy's are enclosed in proprietary IP67 cases made out of plastic and anodized aluminum. They operate in a hot environment, the exterior of the case varies between 50°C and 100°C. We do not know the inside temperature as we are not logging it.

    We successfully operated about 10 of these units for about 2 months without problem (the units are kept operating almost 24/7). Then, a couple weeks ago, the units started resetting constantly, at least once an hour, sometimes more. This makes our units useless in the field. We determined the cause for the restarts are Random Core Panics, as shown below:

    Core 1 register dump:
    PC      : 0x40114bb0  PS      : 0x00060334  A0      : 0x80085755  A1      : 0x3ffc1360  
    A2      : 0x00000002  A3      : 0x3ffcb728  A4      : 0x00000000  A5      : 0x01e811ab  
    A6      : 0x3ffcc784  A7      : 0x00000001  A8      : 0x80084a0e  A9      : 0x3ffc1340  
    A10     : 0x01e811ab  A11     : 0x3ffc1361  A12     : 0x3ffc1361  A13     : 0x3ffcc7e0  
    A14     : 0x3ffcc7d0  A15     : 0x3ffae270  SAR     : 0x0000000e  EXCCAUSE: 0x00000007  
    EXCVADDR: 0x00000000  LBEG    : 0x4009c146  LEND    : 0x4009c155  LCOUNT  : 0x00000000  
    Core 1 was running in ISR context:
    EPC1    : 0x40084507  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40114bb0
    
    Backtrace: 0x40114bb0:0x3ffc1360 0x40085752:0x3ffc1390 0x4008585d:0x3ffc13c0 0x40084593:0x3ffc13f0 0x40083b2d:0x3ffc1410 0x4006222d:0x3ffdba60 0x4009c061:0x3ffdba80 0x4009c09a:0x3ffdbab0 0x4009c3f1:0x3ffdbae0 0x4008bc08:0x3ffdbb00 0x4008bc36:0x3ffdbb30 0x4011fa37:0x3ffdbb50 0x40120779:0x3ffdbb70 0x4011f2fb:0x3ffdbb90 0x4011f0d7:0x3ffdbc00 0x4010b891:0x3ffdbc40 0x40117633:0x3ffdbc60 0x4010b940:0x3ffdbc90 0x400fc3d5:0x3ffdbcc0 0x400f8bfd:0x3ffdbce0 0x400f8c65:0x3ffdbd00 0x40104212:0x3ffdbd20 0x400fc4e4:0x3ffdbdc0 0x400f8bfd:0x3ffdbdf0 0x400f8c2a:0x3ffdbe10 0x400df1da:0x3ffdbe30 0x400df44d:0x3ffdbed0 0x400de08b:0x3ffdbef0
    

    However, the issue is very odd. If we replace the LoPy4 inside the faulty units with a new one, the new ones operate flawlessly. Which leads us to believe the LoPy4's in the old units have become faulty, but in a very weird way, because: aside from the Random Core Panics (Guru Meditation Error), the units operate nominally between the random resets.

    My question is, has anyone experienced something like this? Where some LoPy4's experience many more Random Core Panics than others? Do you think the fact that they operated in a hot environment could have "degraded" the units in this manner?

    FYI: we have discarded the problem being in our PCB. It's the actual LoPy that has become faulty.

    We are using the following firmware version on all of our devices:
    (sysname='LoPy4', nodename='LoPy4', release='1.20.0.rc13', version='v1.9.4-94bb382 on 2019-08-22', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1')



  • Dear Dan,

    it is interesting that you are seeing these faults on some devices and not on others, this might indicate something like the PSRAM cache bug related to the ESP32 rev.1 and rev.2 chips [1].

    Actually, I don't know if Pycom ships revised LoPy4 devices already including the ESP32 ECO V3 [2] where these silicon bugs have been mitigated - you might be able to investigate by reading out the chip information [3].

    @d-alvrzx said in Random Core Panics only on some LoPy4 units:

    we have not been able to try your Dragonfly or Squirrel builds neither.

    You should really try [4]. While we can't promise anything, they are already helping others in similar scenarios when accessing the flash memory at runtime.

    With kind regards,
    Andreas.

    [1] https://community.hiveeyes.org/t/random-memory-corruption-faults-on-esp32-wrover-rev-1-and-rev-2-when-running-in-dual-core-mode/2515
    [2] https://www.espressif.com/sites/default/files/documentation/ESP32_ECO_V3_User_Guide__EN.pdf
    [3] https://community.hiveeyes.org/t/read-information-about-the-chip-and-flash-on-pycom-devices-through-the-expansion-board/2699
    [4] https://community.hiveeyes.org/t/squirrel-firmware-for-pycom-esp32/2960



  • Dear @andreas

    Thank you for your prompt response. We actually spoke over on the Hiveeyes forum (see [1]) regarding random core panics, I sent you 2 .txt files containing the core dumps we had been able to capture.

    a) Since then I have captured other full core dumps from the faulty devices, I have sent them to you via private message on Hiveeyes, because I am unable to upload .txt files here :(
    b) No, we are currently using FatFS. We have not yet migrated to LittleFS because you had mentioned Pycom's implementation had a bug.
    c) We write to an SD card approximately every 30 seconds to log data. We also load approximately 30 values from NVRAM at start-up and then sporadically write to NVRAM to update some of these variables.

    I would like to add 2 critical pieces of information:

    1. We have erased the flash of the faulty LoPy4 units with esptool.py --port <serial_port> -b 115200 erase_flash command, but even after a full erase and flashing the firmware again with the Pycom Firmware Updater, the units remain faulty (i.e. the random core panics continue).
    2. We have attempted to update our firmware to the newest version released by Pycom (namely 1.20.2.rc3 and 1.20.2.rc6), but this new firmware is not behaving well with our software. We have checked the changelogs for these new versions, but Pycom must have changed something that we have not yet noticed. For this reason, we have not been able to try your Dragonfly or Squirrel builds neither.

    Best regards,

    Dan

    [1] https://community.hiveeyes.org/t/investigating-core-panics-on-the-lopy4/2878/31



  • Dear @d-alvrzx,

    thanks for sharing your observations. May I ask you about some more details?

    a) Are you getting any full core dumps from your devices?
    b) Are you using LittleFS already?
    c) Do you read or write anything from/to the filesystem at runtime?

    With kind regards,
    Andreas.


Log in to reply
 

Pycom on Twitter