pbuf_free crash cause?



  • Hi all,

    My code has been randomly crashing intermittently for a while now showing up as guru_meditation errors, however with the new update to 1.20.2.r0, it seems to manifest itself a bit differently with more information.

    Now, before the core dump, I get these messages:

    assertion "pbuf_free: p->ref > 0" failed: file "/Users/ehlers/pycom/pycom-esp-idf/components/lwip/lwip/src/core/pbuf.c", line 765, function: pbuf_free
    abort() was called at PC 0x40167e57 on core 1
    
    ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000
    
    Backtrace: 0x4008eec3:0x3fff6e70 0x4008f045:0x3fff6e90 0x40167e57:0x3fff6eb0 0x401745a7:0x3fff6ee0 0x40171541:0x3fff6f00 0x40172082:0x3fff6f80 0x401720de:0x3fff6fb0 0x400ed4ee:0x3fff6fd0 0x400e5f8d:0x3fff6ff0 0x401013ae:0x3fff7030 0x400fd7b9:0x3fff7050 0x400fd849:0x3fff7070 0x401097f7:0x3fff7090 0x401014a4:0x3fff7130 0x400fd7b9:0x3fff7160 0x400fd849:0x3fff7180 0x401097f7:0x3fff71a0 0x401014a4:0x3fff7240 0x400fd7b9:0x3fff72c0 0x400ffd1c:0x3fff72e0 0x400ffd39:0x3fff7330 0x400fd7b9:0x3fff7350 0x401087d8:0x3fff7370 0x400e0749:0x3fff7400
    
    ================= CORE DUMP START =================
    a lot of gibberish I can send if helpful
    

    I'm not sure, but I'm assuming this crash has something to do with free memory? Can someone confirm or clue me in to what is the probable cause type of this crash so that I can work towards fixing it?

    Thank you,
    Troy



  • @troy-salt The error appears in the packet buffer management of lwip, which is the lightweight TCP/IP implementation used in the ESP-IDF (the SDK for the ESP32), so you probably won't see a direct link between your python code and the error, though understanding what your code does (such as using TCP/IP connections on multiple threads) may help in finding out what the core issue is.

    Is the error easily reproducible? Is there any way for you to build a minimal test case which exhibits the behaviour?

    Apparently lwip can work in a multi-threaded environment, but there are restrictions on what functions can be called from which threads. The pbuf code is supposed to be thread-safe, though.

    The error is triggered here: https://github.com/pycom/pycom-esp-idf/blob/master/components/lwip/core/pbuf.c#L694 (though the path doesn't quite match, not sure if there isn't some version mismatch?). It's about reference counts of buffers. Apparently the decrement itself is protected, but not the actual free.

    If someone can symbolicate the dump it should tell us a bit more (especially if we are lucky enough that the crash happens on a call not from the right thread).

    Edit

    Actually, I don't remember if python threads are actually related to CPU threads. I think there was a discussion on this topic a few weeks back...



  • Hi @robert-hh and @jcaron

    I'm using a WiPy on version 1.20.2.R0.

    The source code I'm running is quite large with about 30 files and several thousand lines of code, so it might be a lot to dig through. Or are you saying that you can use the core dump to narrow down into which line/file might have caused the crash? I could send you the source code privately if it helps.

    Also of note, when the device is running nominally, it generally runs at 500kb/4MB flash, and 1.7MB/ 4MB memory.
    Does this help point towards either situation? I use several threads in my program too.

    I uploaded the full crash report including the core dump here:
    http://s000.tinyupload.com/?file_id=51000101731936818409

    I appreciate you taking a look, but I also understand if it's too involved of a process to dish out your free time on.

    Thanks,
    Troy



  • @troy-salt given the error message, my guess leans towards either a double free or corruption of some memory (possibly a buffer overflow).

    If you provide the details of your board it is possible to decode the stack trace and core dump to know the sequence of calls and try to find the error (though it’s much easier if it’s a double free rather than memory corruption, as the latter possibly happens in a completely different place in the firmware).

    Providing the Python source code you use would also help, along with any other traces and logs it generates which could tell us what it is doing at that time.



  • @troy-salt You mentioned the firmware version, but it is also important which board you use.


Log in to reply
 

Pycom on Twitter