GPY rogue safe mode reboots
Having a problem with GPYs returning from the field with boot.py & main.py both in their default (put your software here) states. We conclude that they have had a safe boot. But why? The GPYs are mounted on a host board of our own design that provide 5V & some screw terminals for inputs with P12 not connected to anything. Given the GPYs are mounted in a sealed enclosure with no access to the reset button we are struggling to understand how a safe boot (reset button pressed with P12 connected to 3v3) is occurring. Thoughts or advice gratefully accepted.
- Not seeing evidence of a corrupted file system Robert, all nvs.sets and error.log files are OK. Just boot.py & main.py have been rebuilt to the 'put-your-code here defaults'. Might actually skip the imports & simply have 2 independent copies of the user code at boot.py & main.py
- Just checked the code, we don't actually close files, we use 'with open' which does not require file closure. Is that OK?
- What if I'm wrong & boot.py & main.py have been rebuilt due file system corruption. Is there a way of dealing with this?
- It tries boot.py If all OK main.py never runs because the the user code is non terminating.
- If boot.py or usercode1.py are missing/corrupted it rebuilds boot.py to the default.
- It then tries main.py which should run the backup user2.py
We always close any open files so that's not the cause of our woes
- Running main.py is the default. So machine.main("main.py") will not change anything, if main.py is corrupted.
- Putting machine.main("user.py") into main.py has not effect, because the file named by machine.main() will be executed INSTEAD of main.py and has to be called, before main.py or whatever is named as replacement is called. You can put 'machine.main("user.py")' into boot.py, and 'import user.py' in main.py. Then you have a double chance.
- Yes. machine.deepsleep() qualifies as 'never terminates', because on wakeup from deepsleep the device reboots. Please take care of @jcaron's hint about closing open files before calling deepsleep().
- Currently we have machine.main('main.py') in our boot.py & the user code in main.py The thinking was if boot.py failed to run main.py then main.py should be the next thing it trys. Is this thinking correct in terms of getting 2 chances for the user code to run?
- If we had the user code in user.py and machine.main('user.py') in both boot.py & main.py, would that be a better option in terms of getting 2 chances for the user code to run?
- The last line in the user code is a machine.deepsleep, does that qualify as 'never terminates'?
- machine.main('mycode.py') executes mycode.py instead of main.py, while importing mycode.py in main.py executes both.
- Funny enough, the test for main.py is first, then boot.py
- if mycode.py never terminates, then you can do that. Putting it into boot.py would result in main.py nerver being executed. The actual bott sequence is:
a.1) on Gpy and FiPy: start modem.
b) if not safeboot: boot.py
d) if not safeboot: main.py or whatever is defined by machine.main()
That is related to the most recent master branch, file mptask.c, lines 280-330
@robert-hh Interesting Rob;
- So instead of machine.main('mycode.py') in your main.py you use import mycode.py instead? Are these not equivalent?
- You say that boot.py & main.py are independently tested for existence, presumably boot.py first then main.py?
- If we had import mycode.py in both boot.py and main.py would that double our chances of success? That is, if boot.py was missing or corrupted would it then load our code via main.py instead & only rebuild boot.py as the default (insert your code here) blank?
@kjm If the flash file system cannot be mounted, then it will be recreated, erasing all files.
If the file system exists, then main.py and boot.py will be independently tested for existence and then re-built.
Before boot.py and main.py, the scripts _boot.py and _main.py are executed. These reside in the firmware as frozen bytecode. You can change these only if you create your own firmware.
Beyond that, my personal style is to keep main.py short and mostly on import my bulk code there. So if my code is in e.g. my_main.py, main.poy would include a statement 'import mymain'.
That will not prevent file corruption, if that happens e.g. due to unfinished writes.
@jcaron How does mp_task.c work? Do both boot.py & main.py have to be corrupted for it to rebuild new blank ones or will it switch to main.py if it can't find boot.py? Are there any alternatives for user code other than boot.py & main.py?
@kjm IIRC there have been quite a few discussions about corruption of the filesystem when mixing writing to files and deep sleep. Something like the write not being fully finished before going to sleep and the filesystem being left in an inconsistent state. I guess this would most probably happen any time a file crosses a cluster boundary. But my understanding is that the whole flash would be wiped, not just the two files.
Don't remember all the details, especially whether that was with "native" deep sleep (
machine.deepsleep) or Pysense/Pytrack-controlled deep sleep (or both), nor whether that was fixed in some release in the firmware.
I believe littleFS was actually introduced especially for this reason (and I see at least one post saying switching from FAT to littleFS fixed a similar issue).
Sorry, don't have the time right now to dig in the forum (just found a bunch of posts with issues similar to yours), but this could give you pointers to look for.
Things you can try:
- switch to littleFS (requires an upgrade)
- upgrade to get a newer version which may include some fixes?
- make sure you properly close files before going to deep sleep, and possibly adding a small delay before sleep. Not really what you want to hear in terms of power efficiency, though.
@robert-hh I agree that boot.py & main.py are being recreated after the machine wakes from a deepsleep & can't open one of them. The question is why? I think it must be down to dodgy flash. I've got a hunch it might be temperature related, this only seems to happen when we mount them outside with a solar panel. Inside the enclosure can get up to 50 Celsius here in Australia. Surely flash should remain functional at 50 celsius?
@kjm They all are stored in the same flash chip. The nvs store is in a different memory region. So that is not affected by re-creating the file system. I wonder why just main.py and boot.py are re-created, but not any other file. Because recreating these w/o touching other files would only happen if they cannot be opened during boot. If the filesystem cannot be mounted, it will be recreated, deleting all files in it.
@robert-hh I don't understand why our user code in main.py is subject to corruption when strings we store in a regular python file with f.write or integers we store with pycom.nvs_set are never corrupted. Do you know if all 3 are stored in the same flash memory? If yes any thoughts on why main.py is more susceptible to corruption than the other two?
@kjm With v1.20.0.rtc13 you have the choice between FAT and littlefs. I cannot tell a lot about boot times, only that they seem slow. I have no GPy for testing, only FiPy. And that may be similar. The difference I noticed between 1.18. and 1.20 is, that the LTE modem is already started during boot without instantiating the LTE class. That may be the reason for increased boot time.
@robert-hh So when you say 'try development release versions of the firmware' Rob you mean versions after the 1.18.1.r7 we're running now? My problem with later releases is that the code run time between deepsleeps if nearly doubled with later versions. Ours is a battery powered application so that's why we've been sticking with 1.18.1.r7. Also, if I understand you correctly, later versions like v1.20.0.r13 use the littlefs file system which is even more prone to corruption than 1.18.1.r7s FAT?
@kjm Empty boot.py and main.py will be created as part of the MicroPython start procedure, if the file system is corrupted. The code is in mp_task.c, line 400ff for FAT, line 481ff for littlefs (with v1.20.0.r13) . In the past, I had a few use cases with a corrupted file system:
- when using littlefs
- when using the pre-built pymesh images with a single device.
And so I avoid both variants. I did not go into details about what has caused the corruption. Band 28 was always supported, so the problem other people had with band 8 and the board revision does most likely not apply. The boards had to be replaced.
So maybe you could try the development release versions of the firmware.
Thnx for the reply Rob:
- GPYs still have their release='1.18.1.r7 firmware
- We've never done any OTAs so only that version is on board
- You're right about safe boots they don't delete main.py & boot.py, just stop them from running
So we're left scratching our heads how those 2 files with our programs are being replaced by the blank defaults.
- Can you suggest a possible mechanism?
- Could it be flash corruption? Will the GPY firmware create new blank main.py & boot.py if it encounters corrupted user code in those files?
- If not under what circumstances might the GPY create new blank main.py & boot.py?
@kjm What is the status of the boards. Did they loose their firmwate? Safe boot does hardly more than booting without executing boot.py and main.py. If you loaded more than one firmware image by OTA, than it may switch to to that other image. If you load the same firmware twice by OTA, then there is no difference.