Watchdog failed? too much timer.alarm and timer.chrono?

thibault

Hello all, i'm front a big problem with the watchdog.

A lopy run since 14 hour, when he loop restart by watchdog timer. The watchdog was set for 10 s, wich give me the time to try some code for understand what become.

My watchdog:

class Watchdog:

    def __init__(self):
        self.wdt_feeder_period = 3 # secondes
        self.wdt_timeout = 10 # secondes
        self.wdt = WDT(timeout=(self.wdt_timeout*1000))  # timeout en milliseconde
        self.__feeder = Timer.Alarm(self.feed_the_dog, self.wdt_feeder_period, periodic=True)

    def feed_the_dog(self, alarm):
        self.wdt.feed()
        collect()

watchdog = Watchdog()

when the device reboot, i have few second to send him code (with CTRL+B shortcut) like this

for _ in range(100):
    watchdog.wdt.feed()
    sleep(0.2)

this one leave the device available. But this one not:

class Clock:

    def __init__(self):
        self.seconds = 0
        self.__alarm = Timer.Alarm(self._seconds_handler, 1, periodic=True)

    def _seconds_handler(self, alarm):
        self.seconds += 1
        watchdog.wdt.feed()

clock = Clock()

It's like the timer.alarm don't want run(?). I have try use a thread to feed the (watch)dog and this rule fine. But i don't know why the wathdog (and timer.alarm) don't want run. After many restart, the device seems stable, and don't restart. But it's clear that my code was not stable. And i don't know if a such behaviour is because i use timer.alarm in two module permanently, and two timer.chrono sometimes.

In waiting any idea for explain this behaviour, i will try to replace timer.alarm with thread. But i will fine if anybody find the reason (and maybe have a solution) for this silly watchdog. Thank's

g0hww

I thought that I would mention another trick that I use, when using the REPL and code is running. The content of main runs in a try block and I have catch blocks like so:

    except KeyboardInterrupt as e:
        # the wdt will give us 30 mins to play in the repl ...
        wdt = WDT(timeout=30*60*1000)
        print('CAUTION: Watchdog will reboot in 30 minutes!')
        print('Node stopped')
    except Exception as e:
        sys.print_exception(e)
        # game over man, expect reboot soon!
        wdt = WDT(timeout=5000)
        pycom.heartbeat(False)
        utime.sleep_ms(100)
        pycom.rgbled(0x7f0000) # red
        print('FATAL: Watchdog will reboot soon!')

Catching the keyboard interrupt means that I get 30 mins to play in the REPL console before the thing will reboot, in case I wander off and forget to restart it.

thibault

I will have the result in few days (if our device don't restart) but it seem that the reason of our reboot problem was due to the misuse of timer.alarm creation from interruption. Since we have done some modification about this, the device seem get a better stability.

thibault

@g0hww ok, it's clear to me now. Thank's!

g0hww

@thibault

I set self.cbit_timer = None so I can check to see if it is None in the destructor for the Node object and cancel it in the destructor if it isn't None. I prefer to call cancel() on it first, though that may be excessive.

thibault

@g0hww thanks for your code example and experience feedback. I see that you use

                self.cbit_timer.cancel()
                self.cbit_timer = None

to disable the timer. At your opinion, it's neccessary whereas the periodic input of the alarm creation was set to False? For be sure that the alarm was delete?

in our case, our device can stay unused for long time (some hours), so there is no main function will run when he does nothing. So i can't use the same idea of you, and for now i have use a thread to solve (i hope) this problem. But i'm curious to understand if i can use or not the alarm object. But for later coding alarm, i will use your solution for create/delete alarm!

g0hww

After concluding that my codebase needed a fair bit of refactoring to make it stable, and it has been stable for a long time now (though I haven't tried any FW updates for about 1 year), I ended up with the following approach.

My class Node provides the following 2 methods of relevance

    def cbit_cb(self, void):
        try:
            if self.cbit_timer != None:
                self.cbit_timer.cancel()
                self.cbit_timer = None
            self.cbit_active = False
            self.wdt.feed()
            #print('Node.cbit_cb() complete')
        except Exception as e:
            print('Node.cbit_cb() - failed!')
            sys.print_exception(e)

    def cbit(self):
        with self.cbit_lock :
            try:
                if self.running and not self.cbit_active:
                    self.cbit_timer = Timer.Alarm(self.cbit_cb, s=1.0, periodic=False)
                    self.cbit_active = True
                    #print('Node.cbit_() starting cbit check')
            except Exception as e:
                print('Node.cbit() - failed to set cbit timer!')
                sys.print_exception(e)

and my main() has this ...

                with Navigator(pyt, node, 10) as nav:
                    while node.is_running() :
                        #print('Doing work')
                        if nav.work():
                            node.cbit()
                            gc.collect()
                            node.try_uptime_status()
                            node.try_memory_status()
                            node.try_battery_status(pyt)

The intent with this code is to not rely on a repeating timer callback to kick the dog. Instead, when the main thread succeeds in performing its primary task, it calls node.cbit() which sets up a one shot timer callback which actually kicks the dog. This ensures that the primary work is being done successfully and avoids the problematic scenario in which a repeating timer callback keeps kicking the dog when everything else has turned to poo.

That code has been running on a LoPy4 GPS tracker (which sends its position via LoRa) for 23 days. I'm not sure why its uptime isn't longer as it is stable enough that I don't pay attention to it any more. I have another variant of that code running on another pair of LoPy4s that act as LoRa to MQTT gateways. They tend to get watchdog resets when their MQTT publishing fails for more than 5 minutes, using the same mechanism shown above. This is normally because the wifi goes down when the router is rebooted and those gateway devices have nothing better to do than reboot under such circumstances.

Of course there is more code to handle deletion of the node object and the cancellation of any active timers, etc.

Hope this helps.

Explore Pybytes | Official Documentation | Report a Firmware Bug/Issue | GitHub

Watchdog failed? too much timer.alarm and timer.chrono?

Pycom on Twitter