Strange Micropython Code Execution With Custom (But Unused) C-Code



  • Dear community,

    since a certain stage of commits in our development, we are facing strange errors: at seemingly random times, some of the try/except sections catch something like this:

    First type of error

    Reason: unsupported types for __lt__: 'NoneType', 'int'
    Reason: unsupported types for __truediv__: 'float', 'function'
    Reason: unsupported types for __lt__: 'NoneType', 'int'
    Reason: unsupported type for __hash__: 'dict'
    

    even if the regarding lines of code are executed fine otherwise.

    Second type of error

    or in the data JSON in a SD card, this is logged:

    "sensor value": <function "linear" at 0x3fa0ae50>
    

    before "sensor value" is written to SD card, there is some dataprocessing happening in this style:

    sensorvalue['sensor value'] = linear('sensor name', sensorvalue, Y)
    

    (note that the dictionary is also used as a parameter to the function, may that be a good architecture decision or not ;) )

    So, instead of setting the value to the result of the function, the function object itself is written to the position in the dictionary, and the string serialization transforms that into the <...> representation when the dictionary is json.dumps-ed to the SD card. But that just seems another form of the errors of the first kind.

    The errors at first seemed only to appear if there is additional C-code present (beginning with a certain commit). It does not matter if the C-code is acutally used; the errors also appear if none of the C-functions are accessed. But they seem to occur more often, if the statically defined data uint8_t - buffers are larger or the C-code is actually used (creates two additional freeRTOS tasks with around 6k of stack each).

    So one hypothesis is: the C-code generates additional flash and RAM usage for the statically allocated buffers and other variables. That may not leave enough stack for the freeRTOS Python-task on the ESP32 (with external PSRAM) to run uninterruptedly.
    Today, I also saw errors like that with a compiled version where the C code is physically present on the disk, but not used in the make files, and I also can confirm that there are no *.o files of the C code in the build path. I will check if the behaviour changes if the C source files are no longer in the project path. So, maybe this hypothesis is no longer valid, but then I don't know what else could produce such a behaviour.

    This is what the compiler puts out as statistics after compilation:

    xtensa-esp32-elf-size build/WIPY/release/application.elf
       text    data     bss     dec     hex filename
    1016180  628056   47060 1691296  19cea0 build/WIPY/release/application.elf
    

    we are using Pycom 1.20.2.r4 and 1.20.2.r6, error happens in both versions. We already stripped off Bluetooth an CAN modules in order to have enough space for the rather large Python code. This can also be reproduced on all WiPy devices we tested the firmware on (more than 10).

    Is the hypothesis correct? Or what other thing could be the cause of such errors?
    Did anyone here experience such a behaviour?
    If that is linked to the size of the C-code and variables, can we, already at compile time, check violation of limits before running that code on the device?
    What other things can we test to locate the cause of the error, or, what solution may already be out there?

    Thanks in advance for help!

    PS: posted that also here:
    https://forum.micropython.org/viewtopic.php?f=18&t=11653



  • Most interestingly, it is always like that: the device runs for some time (in the realm of several hours), and after that, the described errors repeatedly appear in shorter cycles than the time span til the first appearance of the error.

    I think we were able to track the origin of the error down, at least in this specific environment (ESP32, Pycom 1.20.2.r6, custom Python code, also combination of custom Python and C code):

    All binaries compilled with xtensa 1.22.0-80 show this strange behaviour.
    All binaries compiled with xtensa 1.22.0-98 DO NOT show this strange behaviour (same device, same VM, also on other devices with different VM, underlying source code version with which tests were performed also differed a lot)

    So for now, this seems to be solved by simply use a more recent compiler. It would be very interesting why this is (seemingly) compiler-dependent?

    @pycom it could be worth to update the download link to the xtensa compiler. I will also open a Github issue, because since 23 days no Pycom member seemed to like to help here in the forum.

    It still leaves some bitter taste though, because how can we be sure that it never accours again with the newer compiler, only at some later point in time? Test duration was only around 5 - 6 days.



  • Update:
    There was a misunderstinge between the other person and me. So the binary that did produce the strange execution behaviour was actually compiled on a (third) different machine, which had the xtensa compiler version 1.22.0-80 installed. The versions that worked fine were built with 1.22.0-98, and on my machine (which produced binaries that show strange behaviour) again with 1.22.0-80.
    Note: current readme in Pycoms repository points to 1.22.0-80 for xtensa

    Current summary:

    VM1: always shows errors
    GCC: 10.3.0 and 11.2.0
    xtensa: 1.22.0-80

    VM2: never showed errors
    GCC: 11.2.0
    xtensa: 1.22.0-98

    VM3: always shows errors
    GCC: 11.2.0
    xtensa: 1.22.0-80

    Tests still pending to see if that is provable by using xtensa 1.22.0-98 on VM1



  • OK now I am lost and hopefully someone can put me in the right direction or has a hint about this.

    I compiled the untouched pycom 1.20.2.r6 with custom Python code under esp32/frozen/Custom. No submodules involved there.

    • The version I compiled (Ubuntu Mate VM 21.10, GCC 11.2.0) shows the strange errors above on a device, that with a different version before did not produce these errors. So a device failure is ruled out.
    • The version another person compiled on their machine (Ubuntu Mate VM 21.10, GCC 11.2.0) does NOT schow the strange errors. BUT: using a mpy-cross compiled on my machine, the other person was also able to compile a binary with which the devices produced such errors. We now about to try to compare compiled binaries between the two VMs.

    If the frozen bytecode is sometimes executed wrong, or memory is confused somehow, why could it seemingly be dependent on the machine the code is compiled on?
    Why does something like this happen at all?



  • The binaries created with mpy-cross are the same (checked with vbindiff), no matter what gcc version was used to compile mpy-cross. So that cannot be the cause.



  • Some new insights:

    • Those erros can also occur after a freshly cloned pycom 1.20.2.r6 version
      • without Bluetooth and CAN
      • without any custom C code
      • with frozen Python code
    • It does not seem to matter if Python code to be frozen is located in "Custom" or a different folder (added in mkrules.mk)

    I am investigating further, but it seems that in other versions that worked the Python code was frozes by an mpy-cross which was compiled with GCC 11.2.0. The ones that seem to cause trouble were compiled with GCC 10.3.0. The binaries of both mpy-cross versions are different in some places.


Log in to reply
 

Pycom on Twitter