URGENT ! Pycom firmware/hardware freezes after some hours.



  • We have been rolling the pycom GPY chips into production. Our pycom Gpy code is suppose send data every 3 seconds to our cloud. But we observed that the pycom Gpy stops sending data to the could after 12-15 hours. We went to the location and checked the chip. The blue light keeps on bilking for days but the we do not receive any data on the cloud. Can someone tell is pycom designed for continous monitoring? If yes then why such issue occurs?Please help!!! I am attaching my code below:-

    def main():

        import socket
        import sys
        import machine
        import ssl
        import time
    
        from network import LTE
        from machine import Pin
        from machine import WDT
    
        count = 1
        lte = LTE()     # instantiate the LTE object
        #lte = LTE(carrier="verizon")
        lte.attach()
        #lte.attach(band=13)
        print('test')      # attach the cellular modem to a base station
        while not lte.isattached():
    
            time.sleep(0.25)
            print('not attached')
        lte.connect()       # start a data session and obtain an IP address
        Data = 'string-'
        while not lte.isconnected():
    
            time.sleep(0.25)
            print ('not isconnected')
    
        while lte.isconnected():
                print('connected')
    
                #i=1
    
                    #avg = val+val
                #avg2 = avg/10
                #final_ppm = avg2/0.792
                #print(final_ppm)
                adc = machine.ADC()             # create an ADC object///// for pressure sensor
                apin = adc.channel(pin='P16')   # create an analog pin on P16
                val = apin()
    
                value = (val/35)
                #ppm=0
    
                #w=1
                #for w in range(0,1000):
                    #import machine
                    #adc = machine.ADC()             # create an ADC object
                    #apin = adc.channel(pin='P15')   # create an analog pin on P16
                    #val = apin()
                    #ppm = val + ppm
                    #print (val)
    
    
                #avg2 = ppm/1000
                #final_ppm = (avg2/3.7)
                #print(final_ppm)
                #Data += final_ppm
                #Data = 'string'
    
                #Data += str(w) + '-' + str(final_ppm)
    
    
                #print (Data)
    
    
    
    
    
                #Data += 'final_ppm'
    
    
                #x= (value)
                x = 105
                #print(value)
                #print(Data)
    
    
    
    
    
    
    
                #my_dev = "https://api.thingspeak.com/update?api_key=I3VOLTPRDMJPO05W&field1={val}\r\n\r\n"
                #let url = url (string)
                ##my_dev = "//api.thingspeak.com/update?api_key=I3VOLTPRDMJPO05W&field1={}".format(value)
                #print (my_dev)
                #link = "https:"+my_dev
                #print (link)
                #print ("GET" +link)
    
    
    
    
    
    
                #print (val)
    
                try:
                    s = socket.socket()
                    print('socket connected')
                except socket.error as msg:
                    print('socket.socket not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except (OSError):
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except Exception as err:
                    print('socket.wrap not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                try:
    
                    s = ssl.wrap_socket(s)
                    print('ssl. wrap connected')
                except socket.error as msg:
                    print('socket.socket not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except Exception as err:
                    print('socket.wrap not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except (OSError):
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except ssl.SSLError as e:
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return    #------paste iotsocket
    
                try:
                    #s.connect(socket.getaddrinfo('api.thingspeak.com', 443)[0][-1])
                    #s.connect(socket.getaddrinfo('iotstagingapi.geoviewer.io',  443)[0][-1])-------Development
                    s.connect(socket.getaddrinfo('iotapi.geoviewer.io',  443)[0][-1])
                    print(' connect to iot socket')
                except socket.error as msg:
                    print('socket.socket iot not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except Exception as err:
                    print('socket not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except (OSError):
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                #s.connect(socket.getaddrinfo('34.206.17.110', 443)[0][-1])
                #s.send("GET" +link)
    
                        #https://iotapi.geoviewer.io/v1/adddevice/rainbow/bb4029def3451783f4a558004bbc696h/0/0
                #s.send(b"GET https://iotapi.geoviewer.io/v1/adddevice/rainbow/PRE029def3451783f4a667004bbc696h/0/0\r\n\r\n")
                #print(s.recv(4096))
                #try:
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/adddevice/rowd/ff1d810c-116a-4c0f-b973-4fd032ef268c/33.9667/-117.9187741")
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/adddevice/rowd/ff1d810c-116a-4c0f-b973-4fd032ef268c/33.9667/-117.9187741\r\n\r\n")
                    ##s.send(b"GET https://iotapi.geoviewer.io/v1/adddevice/rowd/bb4029def3451783f4a558004bbc696f/0/0\r\n\r\n")
                    #print(s.recv(4096))
                    #raise Exception
                #except Exception as err:
                    #print('error caught')
                    #lte.dettach()
                    #lte.disconnect()
                    #lte.reset()
                    #s.close()
                    #return
                try:
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/livedata/rowd/ff1d810c-116a-4c0f-b973-4fd032ef268c/{}/50\r\n\r\n".format(value))
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/livedata/rowd/ff1d810c-116a-4c0f-b973-4fd032ef268c/75/50")
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/livedata/rowd/ff1d810c-116a-4c0f-b973-4fd032ef268c/75/50\r\n\r\n")
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/livedata/rowd/bb4029def3451783f4a558004bbc696f/75/50\r\n\r\n")
                    #print(s.recv(4096))
                    #x = (final_ppm)
    
                    #x =final_ppm
                    #s.send(b"GET https://iotapi.geoviewer.io/v1/adddevice/rowd/bb4029def3451783f4a558004bbc696f/0/0\r\n\r\n")
                    url = 'https://iotapi.geoviewer.io/v1/livedata/rowd/bb4029def3451783f4a558004bbc696f/'+str(x)+'/50\r\n\r\n'
    
    
                    #url = 'https://iotapi.geoviewer.io/v1/livedata/rainbow/TDS029def3451783f4a558004bbc696h/'+str(x)+'/50\r\n\r\n'#-----rainbow tds
                    #url = 'https://iotapi.geoviewer.io/v1/livedata/mswd/cc4029def3451783f4a558004bbc765f/'+str(x)+'/50\r\n\r\n'#------mswd
                        #url = 'https://iotapi.geoviewer.io/v1/livedata/rainbow/PRE029def3451783f4a667004bbc696h/'+str(x)+'/50\r\n\r\n'
    
    
                    #url = 'https://iotstagingapi.geoviewer.io/v1/livedata/mswd/we3029def3451783f4a558004bbc765f/'+str(x)+'/50\r\n\r\n'#----- developmenttesting
                    s.send(b"GET "+url)
                    #print("GET "+url)
                    #print('link  not connected')
                    #print(s.recv(4096)) #USE IT LATER
    
                    #print ('link connected')
                    #raise Exception
                except Exception as err:
                    print('error caught2')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except socket.error as msg:
                    print('socket.socket not conected')
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    machine.deepsleep(40)
                    return
                except (OSError):
                    lte.dettach()
                    lte.disconnect()
                    lte.reset()
                    s.close()
                    return
    
                #s.send(b"GET link")
                #print(s.recv(4096))
                count = count + 1
    
                s.close()
                print (count)
    
                if count == 1000 :
    
                    lte.dettach()
    
                    lte.disconnect()
                    #lte.reset()
    
                    s.close()
                    machine.deepsleep(4000)
    
                    return
    

    from network import WLAN
    from network import Bluetooth
    import machine

    bluetooth = Bluetooth()
    bluetooth.deinit()
    wlan = WLAN(mode=WLAN.STA)

    #Data = 'string-'
    i=1

    from network import LTE
    import socket
    import machine

    print('going to sleep')
    #machine.deepsleep(400000)
    #machine.pin_deepsleep_wakeup(['P13']['P2'], mode=machine.WAKEUP_ANY_HIGH, enable_pull=True)

    #s = socket.socket()
    #lte = LTE()
    while (i==1):
    #while (i == 1):
    print('restart')
    try:
    main()
    except Exception as err:

        print('error caught1')
    
    
    
        machine.deepsleep(40)
        main()
    except socket.error as msg:
    
        lte.dettach()
        lte.disconnect()
        lte.reset()
        s.close()
        main()
    except (OSError):
    
        lte.dettach()
        lte.disconnect()
        lte.reset()
        s.close()
        main()


  • Dear @tanmay07 and @Martinnn,

    if you feel lucky, you might want to try our custom build just released on [1]. More background about this is available through [2].

    Please be aware that when upgrading from a lower firmware version to this custom build based on 1.20.1.r1, you will have to erase your device completely before flashing in order to keep things straight. You will find respective references to this on the forum. Hint: Use pycom-fwtool-cli --port /dev/ttyUSB0 erase_all, see also [3].

    With kind regards,
    Andreas.

    [1a] https://packages.hiveeyes.org/hiveeyes/foss/pycom/vanilla/GPy-1.20.1.r1-0.6.0-vanilla-dragonfly.tar.gz
    [1b] https://packages.hiveeyes.org/hiveeyes/foss/pycom/vanilla/FiPy-1.20.1.r1-0.6.0-vanilla-dragonfly.tar.gz
    [2] https://community.hiveeyes.org/t/investigating-random-core-panics-on-pycom-esp32-devices/2480
    [3] https://community.hiveeyes.org/t/installing-the-recent-pycom-firmware-1-20-1-r1-requires-erasing-the-flash-memory-completely/2688



  • Hi there,

    the ESP32 rev.1 and rev.2 apparently have a hardware bug which we outlined within [1]. Running in dual core mode probably will trigger the respective cache memory coherency issues more likely.

    So, Pycom might not be the one to blame here. While it took some time and people obviously got very impatient about this, Espressif Systems provides a workaround on the compiler level now which seems to work well for others out there.

    While we haven't thoroughly tested that yet, we uploaded some vanilla firmware images to [2] just yesterday. It's the *Py-1.20.1.r1-0.2.0-vanilla-psram-fix.tar.gz files I am talking about.

    With kind regards,
    Andreas.

    [1] https://community.hiveeyes.org/t/random-memory-corruption-faults-on-esp32-wrover-rev-1-and-rev-2-when-running-in-dual-core-mode/2515
    [2] https://packages.hiveeyes.org/hiveeyes/foss/pycom/vanilla/



  • @protean

    As long as I tan into it, there are two main reason for unstable systems and two depending on the type of preoject.

    • Python: You have to test all paths, so you will not run into typical python problems which you find only during runtime.
    • i2c problems with the pycom extension boards. You have to run a lot of long time tests. Otherwise you may run into i2c exceptions. Maybe the deepsleep problem is related to this. Sometimes deepsleep does not go to sleep on my pytrack and so the device never wakes up and start your code again. But since 1.20.rc4 I have not run into this problem.
    • memory fragmentation maybe a problem on larger projects. RAM is limitted. Especially on the old devices. So you cannnot code in the same way as on your PC. You have to keep in mind that the compiler, string and float methods will fragment your memory a lot. You can get around the compiler issue by using a cross compiler on the PC. For the remaining part you have to keep an eye on system. If it need huge buffers, allocate them on an eraly stage and reuse them instead of allocating new ones. Avoid floats and strings. Don't import somewhere during runtime. Import your needed modules in the same way as your huge buffers.
    • some developers don't see the limits of the radio systems. There seems to be no feedback when they are too noisy and the radio submodule forces them to be silent.

    Sadly there is no option to log erros onto SD card. So it's hard to see, that most problems are related to typos, forgotten self. and memory fragmentation. I have seen a project which started to build a GC which was able to reorganize the allocated objects. Would be great to have these two features as an option in micopython.



  • @protean Not directly related, I am sending data via wifi/MQTT to AWS using the builtin library. Devices freeze irregularly with guru meditation error, regardless of the watchdog being used. We'll add an external hardware watchdog and hope for the best.



  • @timh @jsh I know this is an old thread, but we're using Pycom devices for a slightly similar task (connecting to fuel systems for remote monitoring). Are either of you still using these devices? Are you using the Pybytes firmware, or vanilla these days? I'd love to pick your brain, if you have the time.



  • @jsh in my case I don't need to worry about power. I am also more interested that the device is on and monitoring.

    The devices are on diesel pumps and they need to stop and start based on water level. So the pycom power draw is a minor component. 12V system needs to keep things powered on so Solar is present to keep eveything charged.

    But your right WDT, should be only a last resort, not a requirement to get a system running reliably all. (though a necessary last resort)



  • @timh

    Thanks for the information. With the watchdog in place, my project is able to recover from failure. I hate that it fails—and I hate hacking together code like that—but it does work. However, with deinit not working properly, isn't deep sleep useless. Which, again, makes this entire device useless for its intended purpose.



  • I want to reiterate @paulm's thought's here: Right now, this product is unusable for its exact intended purpose. We need someone from Pycom to tell us how to make this thing work reliably to stay connected to LTE and send small amounts of data periodically.

    Speaking for everyone, that's not too much to ask. Speaking for myself, if it happens, I am prepared to purchase a lot of these very soon. Until then, I'm forced to go another route for my clients.

    I want to like this thing as much as I thought I was going to. Please make that happen for us.

    @paulm said in URGENT ! Pycom firmware/hardware freezes after some hours.:

    Can someone from the Pycom team please take a very careful look at this and make public a guide about how to actually use GPy/FiPy for its intended purpose: reliable 24/7, long-duration, in-the-field monitoring/uploading over LTE?

    I am having the exact same problem as the OP. I have been experimenting for days with various LTE strategies, connecting once and keeping open, connecting/disconnecting/detaching for every upload, etc. I cannot get this thing to upload for more than minutes or hours when I put it on a USB charger battery.

    I was intending to experiment more and then make a detailed/technical thread, but before that, is there any general information about these problems? (LTE connection works for some time, then the modem becomes unusable, and I get "the requested operation failed" for any further LTE operations)

    I have invested in 5 FiPy/GPis for long-term, reliable, 24/7 meteorological station transmitting 3 hour drive from my house. I can't afford to have these die in the field and need manual rebooting. They need to work and upload constantly.

    I love the idea of Pycom's products and really, really want this product to succeed. But right now it is very frustrating we can't even seem to use it for its most basic purpose.

    Is there anyone out there who has a FiPy/GPi successfully uploading 24/7 over LTE with no inexplicable interruptions? Can we see your code?



  • @timh Thanks for sharing your experiences, Tim! So you would keep the LTE perpetually "attached", and just call connect/disconnect around your data sessions?

    When did you use LTE.attach() and/or detach()?

    I have found the same with deinit() and also have decided to avoid it entirely.

    The implementation of the WDT is very helpful to see and definitely a must; my strategy is to get this working as good as possible without WDT first, then add it only as a last-hope failsafe for in the field deployment.

    Thank again.



  • @paulm Hi Paul

    Sorry for the various interchangeable terms.

    As to watchdog.

    class FakeWDT:
        def feed(self):
            pass
    

    Then just before I start the main loop I do the following
    I have it set up so that if a particular pin is pulled to ground the watch dog is not considered enabled.
    So you can reboot and re-program/test with out the watchdog kicking in.

    if enable_wdt:
            wdt = WDT(id=0, timeout=settings.get("WDT",15000))
            if debug:
                print("Watchdog enabled")
        else:
            wdt = FakeWDT()
            print("Watchdog explicitly disabled!")
    

    Then at various times in the loop (ie before enabling the LTE interface and during waiting for the attach you feed the watchdog.

    wdt.feed()
    

    As to LTE, I found that deinit wasn't very reliable - would often hang.

    So I have this going on. I pass the wdt entity around so other parts of code can feed. Sometimes the LTE can take a while to connect (longer that 15sec) so each time through the loop I keep track of how may times through, then disconnect and reconnect. My connect function always does a reset, before trying to reconnect. That seemed to preve

             if reconnect_count > connection_cycle:
    
                wdt.feed()
                client.disconnect()
    
                if not lte.isconnected():
                    wdt.feed()
                    lte.disconnect()
                    lte = connect(wdt)
                    wdt.feed()
                    if not lte.isconnect():
                        machine.reset()
    
                client = connect_mqtt(settings,debug)
                reconnect_count = 0
    
    Each pass through the loop I would also check the MQTT and LTE connection status and reconnect if it had dropped out.
    
    This resulted in a pretty reliable system (no physical resets required), that is until Telstra made a change ;-)
    
    Cheers
    
    T


  • @timh Thank you for sharing!

    How did you implement the watchdog timer?

    So you had the modem continuously connected, and "refreshed" every 4 hours?

    You threw out the terms "connect" "attach" and "reset" semi-interchangeably; when did you do either of these things? Does "reset modem" mean anything more than calling lte.reset()?

    Did you ever use lte.deinit()? (Separate from lte.detach() and lte.disconnect())

    Can you post your code?

    It is lamentable that there's no public example offered by Pycom of one of their LTE boards continuously uploading over LTE for extended times. You could help fill this glaring omission!

    I have been debugging/experimenting/trying different things for days - which is the only reason I have the boldness to ask to see your code instead of figuring it out myself.

    I am not sure why Pycom doesn't have more info/instructions in this area.



  • @timh said in URGENT ! Pycom firmware/hardware freezes after some hours.:

    @tanmay07 I had my FiPy/GpY on continuously for weeks. (Until we had a telstra issue and now no pycom LTE-CAT-M1 devices can connect to Telstra - however that is a different issue ;-(

    The two main strategies I took (and I wasn't using deepsleep) was

    1. Explicitly close the connection, reset the moden and then re-attach the LTE connection after a period time - around 4 hours.
    2. Use the watchdog timer, to reset the device in event of a problem.

    Once I did both of these I found the device would stay connected reliably over weeks

    Cheers

    T

    @tanmay07 @PaulM That is it, follow those 2 instructions and you will have a reliable connection, that is how we manage to make it work. The modem seems to hang sometimes so you need to handle it. You'll get it working unless you are in Australia, our network operator here decided to change something and we have now a nice set of expensive bricks and haven't heard from pycom. I hope it does not happen to you! Gpys and Fipys are not certified by any operator, so if one day it stops working, you'll know why...



  • @tanmay07 I had my FiPy/GpY on continuously for weeks. (Until we had a telstra issue and now no pycom LTE-CAT-M1 devices can connect to Telstra - however that is a different issue ;-(

    The two main strategies I took (and I wasn't using deepsleep) was

    1. Explicitly close the connection, reset the moden and then re-attach the LTE connection after a period time - around 4 hours.
    2. Use the watchdog timer, to reset the device in event of a problem.

    Once I did both of these I found the device would stay connected reliably over weeks

    Cheers

    T



  • Can someone from the Pycom team please take a very careful look at this and make public a guide about how to actually use GPy/FiPy for its intended purpose: reliable 24/7, long-duration, in-the-field monitoring/uploading over LTE?

    I am having the exact same problem as the OP. I have been experimenting for days with various LTE strategies, connecting once and keeping open, connecting/disconnecting/detaching for every upload, etc. I cannot get this thing to upload for more than minutes or hours when I put it on a USB charger battery.

    I was intending to experiment more and then make a detailed/technical thread, but before that, is there any general information about these problems? (LTE connection works for some time, then the modem becomes unusable, and I get "the requested operation failed" for any further LTE operations)

    I have invested in 5 FiPy/GPis for long-term, reliable, 24/7 meteorological station transmitting 3 hour drive from my house. I can't afford to have these die in the field and need manual rebooting. They need to work and upload constantly.

    I love the idea of Pycom's products and really, really want this product to succeed. But right now it is very frustrating we can't even seem to use it for its most basic purpose.

    Is there anyone out there who has a FiPy/GPi successfully uploading 24/7 over LTE with no inexplicable interruptions? Can we see your code?



  • PS:- IT only starts transmitting again when we go to to location and hard reset it by unplugging the power supply and plugging it again.



Pycom on Twitter