New GPY - LED Faulty

tlanier

In the last two batches of GPy purchases from Arrow Electronics we have received 3 parts (3 out of 39) that the LED did not function properly. Some of the colors would not work.

Has anyone else seen this problem?

tlanier

@Gijs Sorry for the long delay in responding. I have rewritten our software to eliminate the use of threads and timer interrupts. My hope was this might eliminate the crashes. Instead the crashes still occur in different places and more frequently the processor locks up without printing the backtrace. Last night I was able to get a crash with backtrace.

4de14dd9-41ac-4f18-b47e-01944c287a84-Capture7-NoThreads.png

cf35ac69-b06d-4558-901c-9f71474ca7c7-Capture7-BackTrace.png

The interesting thing here is even though I do not import or use the LTE module, the program is crashing inside a LWIP (Lightweight TCP/IP Stack) routine. Apparently the TCP/IP software is being loaded and called even though I am not using it in any way.

One idea is now to learn how to rebuild the Pycom OS without including the LWIP stuff. I'm not sure how big a task that would be at this time.

I've also investigated a new product that might have some promise. With some effort I was able to get their product loaded on the GPy. They currently do not have support for the Sequans modem in the GPy. The Toit language is definitely harder to read than Python from my point of view and the UART driver documentation is lacking (although it does support CTS and RTS).

Toit | Cloud-managed containers on the ESP32

My decision point now seems to be:

Try to rebuild the Pycom OS without features that I don't need to try to eliminate the crashes. Since I don't require the TCP/IP stack (we use the Sequans modem TCP/IP stack), threading, or timer interrupts, this should make the code simpler. It is a real shame that there is no hardware reset pin to the modem from the ESP32. I hope this is fixed in some future design of the GPy.
Rewrite my app using the Espressif SDK in C/C++. One hurdle I have doing this is that the hardware design already uses the pins that the JTAG debugger requires. It looks like I would have to live with debugging using print statements. I would encourage Pycom to clearly show hardware designers what pins should not be used if they hope to do JTAG debugging with their design.
Pursue learning Toit and try to write a driver for the Gpy's Sequans modem.
Further pursue using generic MicroPython. The initial roadblock I ran into was getting the UART talking to the modem.
Pursue designing an external watchdog timer circuit that would power the GPy down/up after crashes. One challenge to this approach would be to make it field installable. Maybe the watchdog device plugs into the GPy socket and the GPy plugs into the watchdog device. A GPIO pin would be connected internally to allow triggering the device. To be generally useful to current designs, there would need to be some way to jumper/select which currently unused GPIO pin would be used for the trigger.

If anyone out there can think of any better ideas, I'm open to suggestions. Which path is my best chance of success?

Gijs

@tlanier said in New GPY - LED Faulty:

This test also turned up a issue with our program not initializing the mode for checking for registration. I added the modem command "AT+CEREG=0" which disables unsolicited modem responses. Why clearing memory with a different tool affected the mode the modem was in is a mystery.

In version 1.20.2.r2 and later, we switched the CEREG state from 2 to 1 in order to work with the unsolicited modem responses. Though that should be happening in the firmware side, not through erasing the flash..

Concerning the backtrace you decoded, that shows nothing to do with the LTE modem, but it looks like a memory error. Let me share a function that checks all memory stacks, perhaps the issue is there..

def memory():
    machine.info()
    print("GC free=", gc.mem_free(), "alloc=", gc.mem_alloc(), "total=", gc.mem_alloc() + gc.mem_free())
    # micropython.mem_info()
    print("mp stack", micropython.stack_use())
    print("heap internal", pycom.get_free_heap()[0])
    print("heap external", pycom.get_free_heap()[1])

You will need to add the appropriate imports

There are other forum users that have attempted to manually create a LTE library, let me link you to their attempt:
https://forum.pycom.io/topic/6202/method-to-take-control-of-cell-modem

Also, let me know if this coredump persists, or if it is just one-off.

tlanier

@Gijs Using the erase all option in the CLI firmware update tool changed the crash location. I have not tried to repeat this and I even hesitated to post it since I have not thoroughly investigated this behavior. This test also turned up a issue with our program not initializing the mode for checking for registration. I added the modem command "AT+CEREG=0" which disables unsolicited modem responses. Why clearing memory with a different tool affected the mode the modem was in is a mystery.

Instead I moved on to trying some different things. First I tried running generic ESP32 MicroPython on the GPy. It does work but I got stuck trying to talk to the modem. The following code to initialize the uart apparently is not correct. I think the pin assignments are correct, but I could not get any response to a simple "AT" command to the modem.

I know I jumping around, but I'm looking for something that I can make work. My next attempt is to use Pycom's OS but strip out all use of threading and timer interrupts. I hope to have this version mostly complete today. This method may turn out to be a good alternative if it works.

Gijs

It's not exactly equal. Im not sure of the exact differences, but I believe one writes 0xff and the other writes 0x00 to the partitions, and some other minor differences. Im also not sure if it will have the desired effect (ie, stop the coredumps) but let me know of your results!

tlanier

@Gijs Thanks for the response.

When I program the firmware I've always used the GUI Pycom Firmware Update program under Windows.

I always check all 3 boxes.

(1) Erase during update, (2) RESET: config partition, and (3) RESET: NVS partition

Is this equivalent to the "erase all" that you suggest in the CLI firmware updater tool?

Gijs

You have a lot of questions :), I'll try to get to them all:

I'm not familiar with what tools are available to debug low level on the ESP32. Is there a method to monitor a memory address and trap the code that alters the memory location?

You could try to run the espcoredump.py. You can feed it the code in between '====core dump start===' and '=== core dump end===', as .corefile and it gives more information about the threads currently running and other miscellaneous information. Use the following command in a terminal:

$IDF_PATH/components/espcoredump/espcoredump.py info_corefile -t b64 -c ./corefile ./application.elf

Should I move on to the current beta release and try to troubleshoot my program running on it? Would I have a better chance of getting help in resolving the issue? Obviously the developers are working on the new stuff, not the old stuff.

You could indeed check if the issue persists in 1.20.3.b3.

My second test running on a Pymakr board with code removed that would not run on the Pymakr board (no i2c routines talking to our specific hardware) ran for 4 days before it crashed with no backtrace display. Does that mean that the i2c routines accelerate the problem maybe?

I cannot imagine the i2c routine affecting this issue, though who knows...

Is the problem likely within MicroPython? If so should I look at a newer release or should I look at rewriting our code without MicroPython and just using the Espressif SDK?

Im not exactly sure when / where the telnet library was introduced and why its causing issues here. though rewriting the code to the ESP-IDF takes the amount of work to a totally different level.

Should I look at rewriting our code using unmodified ESP32 MicroPython without the changes made by Pycom? Has anyone else tried that? Is ESP32 MicroPython widely used outside of Pycom?

Im not sure what parts you need in the code. Perhaps we can look at custom-compiling a firmware without any trace of the telnet module. I think the only thing you'd need to do is take out https://github.com/pycom/pycom-micropython-sigfox/blob/Dev/esp32/mptask.c#L407, but there might be more complications.

I think Micropython is on a level similar of usage to arduino-esp32, though I might be mistaken.

Is threading the problem? Should I rewrite our program and try to eliminate the use of threads?

Perhaps the issue is here, perhaps it isn't. I've seen _thread do some odd things, but interfere in such a way with other threads I have not yet. They should be defined on the same level as TASK_Servers (which runs the telnet command eventually), next to each other.

Do we need to develop a external hardware watchdog circuit that toggles power on/off if not triggered within a specified time?

If all is well, the device should recover from the coredump, similar to a power reset.

Program crashes in embedded systems are unacceptable.

I'd like to rephrase this to 'Program crashes in embedded systems are unavoidable, take the necessary precautions'. I agree issues like this should not occur (at least not this often), but crashes / issues / glitches are bound to happen, and the way they are handled defines whether they are acceptable.

Perhaps, but Im not sure if you've already tried this or if it will help, using the CLI firmware updater tool, you can do an erase all (https://docs.pycom.io/advance/cli/). Or in the GUI firmware updater tool, use clear NVS and CONFIG partitions, as well as erase flash. In other cases, I wouldn't rule out power glitches, but as you get the same backtrace every time, I want to take this out of the mix.

Still, Im interested in reproducing the issue, or hearing anyone else who has the same issue. I've had devices running for days on end without issues, so Im also curious to what could be causing this to happen repeatedly.
Sorry I did not get to all of your questions, but let me know!

tlanier

@Gijs The program crashed again last night at the same backtrace address even with the server disabled.

I assume that at boot the OS is copied from flash to ram to run. The fact that perfectly valid C code is suppose to be located at that location in memory and that an illegal instruction panic is generated implies that something is overwriting the code at that memory address and altering the code.

I'm not familiar with what tools are available to debug low level on the ESP32. Is there a method to monitor a memory address and trap the code that alters the memory location?

Should I move on to the current beta release and try to troubleshoot my program running on it? Would I have a better chance of getting help in resolving the issue? Obviously the developers are working on the new stuff, not the old stuff.

My second test running on a Pymakr board with code removed that would not run on the Pymakr board (no i2c routines talking to our specific hardware) ran for 4 days before it crashed with no backtrace display. Does that mean that the i2c routines accelerate the problem maybe?

I'm open to any ideas of how to troubleshoot the problem.

Is the problem likely within MicroPython? If so should I look at a newer release or should I look at rewriting our code without MicroPython and just using the Espressif SDK?

Should I look at rewriting our code using unmodified ESP32 MicroPython without the changes made by Pycom? Has anyone else tried that? Is ESP32 MicroPython widely used outside of Pycom? The Adafruit CircuitPython doesn't even support threading. The ESP32 MicroPython V1.15 documentation concerning _thread states:

This module implements multithreading support.

This module is highly experimental and its API is not yet fully settled and not yet described in this documentation.

_thread docs in MicroPython

Is threading the problem? Should I rewrite our program and try to eliminate the use of threads?

Is there something I might be doing in a timer interrupt that could cause the problem? Currently I do i2c code in the timer interrupt. I have now demonstrated crashes on a Pymakr board without i2c code (not the same backtrace address). There could be multiple problems, not just one. I will try just setting a flag in the timer interrupt and doing the i2c code on one of the main threads as a test.

Is the problem likely within the Expressif SDK? If so, should I look at a newer release of the SDK?

Does anyone have any GPy code running for extended periods without crashing?

Is ESP32 MicroPython a viable tool for embedded applications that should not crash? If not, is that the reason why Pycom's new product uses a STM32H7BQIY6QTR chip and not an ESP32 chip? Has Pycom given up on making the ESP32 product reliable and moved their development resources to a different design?

https://pycom.io/product/f01-h7-oem-module/

Does anyone have ideas on what a MicroPython program can do that would cause an illegal instruction panic? Are there known coding errors that might cause the problem? I'm displaying micropython.mem_info() to look for abnormal memory usage, but I see no clues from that.

Do we need to develop a external hardware watchdog circuit that toggles power on/off if not triggered within a specified time?

What path has the highest probability of success?

Program crashes in embedded systems are unacceptable.

Gijs

@tlanier said in New GPY - LED Faulty:

@Gijs Here's a couple of things that have come up while reading the docs.

Shouldn't the function signature be:
pycom.pybytes_on_boot([boolean])

This function call does not exist in OS version 1.20.2.r4.

Thanks, I'll have that corrected. The correct function call should be pycom.lte_modem_en_on_boot([boolean])

Im looking to reproduce the error you're seeing but I'm not quite sure what is actually happening. We get an error here as you've shown before:

    // Check if the telnet service has been enabled
    if (telnet_data.enabled) { <<---
        telnet_data.state = E_TELNET_STE_START;
    }

Which almost makes it seem like there's a bad value for telnet_data.enabled, though we get 'Illegal Instruction'. Could you perhaps attach a minimum reproduction (you can send me a PM if you dont wish to share it publically). Though Im wondering, does it happen on all your devices or just a specific one?

Let me know

Gijs

tlanier

@Gijs We have now had a third crash at the same Backtrace address even with the server disabled. The init_pycom() function is called at the beginning of our program.

The crash below happened less than a day after adding the server.deinit() call to the program. Apparently the telnet_wait_for_enabled() gets called even when the server is disabled. Any ideas of what we can try next?

tlanier

@Gijs Here's a couple of things that have come up while reading the docs.

Shouldn't the function signature be:
pycom.pybytes_on_boot([boolean])

This function call does not exist in OS version 1.20.2.r4.

tlanier

@Gijs Thanks for the info on how to disable Telnet and FTP. I will begin testing immediately with both servers disabled.

Last night I had a third crash which was not like the previous two. Unfortunately in this case the device locked up and did not finish displaying the core dump and backtrace address.

ce576125-1439-40cc-b920-5002528d8715-Capture3-Frozen.PNG

I also have running a second test device on a Pymakr board with a stripped down version of my program (i2c routines and some other stuff removed). So far that test has run for two days without crashing. Maybe this could lead to a sample reference program that others could study. One thing I'm looking for is a public test web site I can do simple gets/posts to in the demo program.

Gijs

Hi,
Thanks for putting in the work and decoding the issue!
I must say I've never seen issues with the Telnet module before.. Though it is possible to disable it using the server module:
https://docs.pycom.io/firmwareapi/pycom/network/server/

though it seems odd. I did a short inspection of the code at the point of the crash, and it's waiting for Telnet to be enabled..

In any case, let me know if that improves your stability!

tlanier

@Gijs I have switched to the latest OS (1.20.2.r4) and modem firmware (48829) as suggested. I have run the application on our board for about 4 days, and I have now been able to capture 2 crashes at the same "Backtrace" address that occurred days apart.

I've run the xtensa-esp32-elf-addr2line program and identified the location where the crash occurs (see below).

telnet_wait_for_enabled()

telnet.c line 306

serverstack.c line 103

Is there anyone familiar with why the program would be executing code in a telnet module? I do not run or need a telnet program running in my app. Can it be disabled?

Any help would be appreciated.

tlanier

@tlanier After reflowing the solder joints on the LED, the LED now works properly.

tlanier

@Gijs Thanks for your response. Today I will go back to testing the OS and modem firmware you recommend and try to provide more detail of the problem.

We have previously tried hardwiring a PC +5V power supply to our design to try to eliminate possible power supply issues. The system would still crash in this test case. We have also tried to accelerate the problem by rapidly powering on/off a drill motor near the unit with the hope of causing the problem. This was also unsuccessful.

The coredumps are different from crash to crash.

We do use the _thread module.

#####################################################################
#                          Initialize Tasks
#####################################################################
def init_tasks():
    n = 8192    # 4096 is marginal
    _thread.stack_size(n)
    log("initializing tasks with stack size: {}".format(n))
    
    _thread.start_new_thread(com.com_task, ())
    _thread.start_new_thread(cycle.cycle_task, ())

We do not use pybytes.

Our current version does not use the LTE() module.

I will look at creating a skeleton program that experiences the same type of crashes that I could publish.

Gijs

Hi,
I'd love to help you further, but unfortunately we're not going back to the older version to update it, and in most scenarios (the ideal world) the latest firmware version contains the least issues. Could you let me know of the issues in the latest stable version (1.20.2.r4). Also the LTE modem firmware has an update available: https://docs.pycom.io/updatefirmware/ltemodem/#note-on-updating-to-cat-m1-52-48829, please use that in your testing. It shows to be more stable than the version you're currently using. Also in the firmware, there have been some updates to improve the communication to the modem, so I really urge you to use the latest firmware.

There's some methods to decode the crash, one of which is described here: https://docs.pycom.io/advance/coredump/. Though I recommend first updating to the latest versions and then decoding it, so we have an actual chance of fixing it, if it is firmware related (it could be related to memory allocation issues, micropython code or power supply ripple). Most of the times, when you have a stable coredump reproduction (ie, the backtrace always points to the same lines of code), we can review the cause. When it varies all over the place, it is generally power supply related.

Perhaps you could share some of your code so we can see what is exactly going on, does it always fail at the same point or at different lines? Are you using _thread or pybytes? All these things could point to a possible solution.

Let me know!

Gijs

tlanier

@Gijs A more serious problem is with some of the chips we have problems with the software crashing. Some chips will run for weeks without restarting. Others have random crashes like the one below. We have tried many different versions of the OS and modem firmware trying to isolate the problem. The one version that we have had better luck with is an older version:

Pycom MicroPython V1.20.0.rc13
Modem firmware UE5.1.0.0f, LR5.1.1.0-41065

Even that version will crash on some of the chips.

** check for modem ready
check signal strength
  modem command: AT+CSQ
    modem response: ['+CSQ: 25,99', 'OK']
signal strength = 25
check registration
  modGuru Meditation Error: Core  1 panic'ed (StoreProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x4010e415  PS      : 0x00060a30  A0      : 0x00006325  A1      : 0x3ffddec0
A2      : 0xe80ec000  A3      : 0x00000000  A4      : 0x3f40600c  A5      : 0x00000001
A6      : 0x00ff0000  A7      : 0xff000000  A8      : 0x40d0e894  A9      : 0x3ffddeb0
A10     : 0x00000001  A11     : 0x00d309a4  A12     : 0x3ffd5a38  A13     : 0x00000001
A14     : 0x000000fe  A15     : 0x00060023  SAR     : 0x00000000  EXCCAUSE: 0x0000001d
EXCVADDR: 0x00000011  LBEG    : 0x4009c401  LEND    : 0x4009c417  LCOUNT  : 0xfffffffb

Backtrace: 0x4010e415:0x3ffddec0 0x00006322:0x3ffddef0

================= CORE DUMP START =================

I would be interested in anyone giving ideas on how to isolate and fix such random crashes. The same program and OS and modem firmware will run on some chips for days without crashing. On other chips the problem happens within minutes/hours. The reason for the crash varies. Sometimes for instance it reports an illegal instruction error. On newer versions of the OS and modem firmware the problems happen much more frequently.

Has anyone tried writing some type of memory diagnostic for the GPy to stress test the system?

Would it help if I purchased an ESP-PROG debugger and tried to learn how to compile and debug the OS?

Has anyone tried developing C/C++ code using the Espressif SDK on the GPy? What has been your experience?

Since our code talks directly to the modem and we do not use the LTE() module, we have a better chance of converting our code to C++. The SDK serial and i2c routines look like they would work fine. It looks like it would take special code to control the LED. We also use the file system to store JSON files and this might require some work in C++.

I would much rather try to make Pycom's MicroPython reliable if that is an option.

Troubleshooting ideas are welcome. I'm just looking for a solution.

tlanier

Yes, we do disable the heartbeat. The program works fine on many units. We use different colors to indicate states in the program (transmitting, connecting, sampling, ...). We have 3 units that have come in bad. My technician has actually swapped the LED on the GPy to fix 2 of them. Today he had a new part that was bad and he now believes that by putting pressure on the chip he can make it work. Tomorrow I will write a test program to cycle thru the colors so he can verify. He may be able to reflow the solder joints and not have to swap the LED chip.

Gijs

Could you elaborate on 'some colors would not work'?
Know that you have to disable the heartbeat before you can change the color, like so:

import pycom
import time
pycom.heartbeat(False)
pycom.rgbled(0xAA0000) #red
time.sleep(1)
pycom.rgbled(0x00aa00) #green
time.sleep(1)
pycom.rgbled(0x0000aa) #blue

Explore Pybytes | Official Documentation | Report a Firmware Bug/Issue | GitHub

New GPY - LED Faulty

Pycom on Twitter