Lopy4 FFT (spectrum analysis) micropython, adding C code?



  • Hi,

    I would like to add FFT (2048 points) as a MicroPython call, written in c code for performance.
    (I won't use high sample rates though, looking in the 0.02 - 20 Hz range of the sensor)

    The code is already written for the esp32:
    http://www.robinscheibler.org/2017/12/12/esp32-fft.html
    https://github.com/fakufaku/esp32-fft

    But how can this be added, will the developer add this on request or should I do that?
    And if I have to do it what is the best practice to add c functions to MicroPython on LoPy4?

    Edit we need this library (optional?) on the pycom devices.
    https://github.com/espressif/esp-dsp/

    Thanks



  • @crumble said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    @KenM

    ESP32 has some nasty memory problems when using both cores. This is fixed in freeRTOS by software. pycom updated it during rc0 and rc8 2 times.
    Float has a high memory load, so this may explain the speed difference, because it has to run through these fixes on each operation. Better slow as faulty. Even better would be an implementation with a lower memory load.

    It would take less memory, be much faster and allow more applications even audio analyzing if the basic DSP library that exist could be added. (https://github.com/espressif/esp-dsp/) (the library contains an int and float version in c and assembler for ESP32)
    But I don't know how to make it available in micropython (that general link wasn't very helpful)

    If PYCOM could make a guide how to add a new micropython function that gets an array and returns an array that would be very helpful!!

    EDIT from Pycom:
    To create a micropython module is pretty straightforward. In the respository Repo:

    • Go to the mods folder in pycom-micropython-sigfox/esp32/mods and create 2 files: modfft.c and modfft.h
    • Add the modfft.c file to application.mk. Add it to APP_MODS_SRC_C
    • Create the micropython fft type like mp_module_uqueue and add it to esp32/mpconfigport.h
    • As examples you can look at moduqueue.c, moduhashlib.c, etc.


  • @KenM

    ESP32 has some nasty memory problems when using both cores. This is fixed in freeRTOS by software. pycom updated it during rc0 and rc8 2 times.
    Float has a high memory load, so this may explain the speed difference, because it has to run through these fixes on each operation. Better slow as faulty. Even better would be an implementation with a lower memory load.



  • Something obviously didn't add up, I didn't want to wait until next week to check it so, used my lopy4 at home:
    Turns out that the two speeds below were not running the same algorithm. (FT vs FFT)
    Also explains why switching between using frozen and not frozen didn't appear to work, it worked, it just didn't show a speed difference.
    So the speed for 512 points was 148ms (using the right FFT)
    With frozen folder I got 149.5ms yes it was even a little slower.
    So not really a good idea to use the frozen folder in this case.
    This is the fast one that I used:
    https://www.nayuki.io/page/free-small-fft-in-multiple-languages

    The less than 1 second code is good enough for my current "analyze -> give a warning" application but if someone wants to put time in making the c code version in some DSP module that would be very nice.

    But I did notice something else that is surprising during my verify tests.
    I did these test on an other Lopy4 at home that still had version r1 and there is a difference in speed, 1.18.2.r1 is faster than 1.18.2.r4.
    The FFT on 1.18.2.r1: 145.4ms (a little faster than 148ms)
    The FT (slow one) 1.18.2.r1: 13.65sec (much faster than 17.09sec, -20%)

    The slow code that gives a big difference between .r1 and r4: if you want to check

    def dft(input, inverse=False): # No simplification
    	n = len(input)
    	output = []
    	if n == 0:
    		return output
    	coef = (2j if inverse else -2j) * cmath.pi / n
    	for k in range(n):  # For each output element
    		s = 0
    		for t in range(n):  # For each input element
    			s += input[t] * cmath.exp((t * k % n) * coef)
    		output.append(s)
    	return output
    

    So summary:



  • @robert-hh said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    @KenM That's far beyond what I had seen before and very interesting. I like to understand what causes the difference. Must have to do with the baytecode/data accesses acces two different memory devices & areas.

    for now take it with a grain of salt because I was also experimenting with faster python code, so this was just before I had to leave, I will verify it next week.
    I couldn't switch between frozen and not frozen so something else was going on in the code. (later more)



  • @KenM I just made quick comparison between a simple code, that performs some integer and float add/multiply in a loop. No significant difference between frozen and non-frozen code. I'm pretty confident that you verified the results from the frozen code. Not that the final time is not in range compared to the findings of Peter Hinch. Just the speed improvement is strange, or to say different, the previous times.
    Edit: I took this python fft implementation from Peter's web site:
    https://github.com/peterhinch/micropython-fourier/blob/master/algorithms.py
    Set the size to 2048 and tried it both in the file system and frozen byte code. The speed for the fft() call is ~640ms each.
    Edit 2: Using micropython.org's variant on an esp32 board I get about the same times, freq-corrected a little bit worse. On a sipeed MaixPy device I get a time of 130ms. freq-corrected that is about twice the speed, however using 64 bit floats.



  • @KenM That's far beyond what I had seen before and very interesting. I like to understand what causes the difference. Must have to do with the baytecode/data accesses acces two different memory devices & areas.



  • I tried the frozen folder and it did speed up much x119: Edit (no it didn't see above)

    I could do FFT 2048 in 0.76sec (still much slower than pure c but fast enough for me now)

    PYB: soft reboot
    pre-proscessing: 48.35505 ms
    FFT: 17.66234 sec
    post-proscessing: 28.35017 ms
    [(0.05078125, 1.189949), (0.1484375, 0.3297071), (0.046875, 0.3009737), (0.25, 0.2653955), (0.1523438, 0.210931), (0.0546875, 0.196845), (0.4492188, 0.1462081), (0.3515625, 0.1452211), (0.04296875, 0.1359008), (0.5507813, 0.1245635)]
    Pycom MicroPython 1.18.2.r4 [ba178ae] on 2019-04-12; LoPy4 with ESP32
    Type "help()" for more information.

    PYB: soft reboot
    pre-proscessing: 48.39582 ms
    FFT: 0.1495537 sec
    post-proscessing: 12.8089 ms
    [(0.05078125, 1.189949), (0.1484375, 0.329707), (0.046875, 0.3009736), (0.25, 0.2653956), (0.1523438, 0.210931), (0.0546875, 0.1968451), (0.4492188, 0.146208), (0.3515625, 0.1452211), (0.04296875, 0.135901), (0.5507813, 0.1245635)]
    Pycom MicroPython 1.18.2.r4 [ba178ae] on 2019-04-12; LoPy4 with ESP32
    Type "help()" for more information.

    I had some problem to try the slow and fast version at the same time, probably due to some cache that remembered what it did last time. ?!?



  • @KenM

    I don't expect much of a speed difference with integers instead of floats because the time I'm losing is not due to slow FP calculation but the huge micropython overhead for each calculation.

    Have a look into the float implementation and you will understand why integers are much faster. Floats are not slow because of the FPU speed or the bytecode interpretation. They waste a lot of your heap and forces a lot of GC calls. Integer behave like your C ints, floats not.

    But C will be faster anyway and you don't to rewrite the lib ;)



  • @KenM said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    I don't expect much of a speed difference with integers instead of floats because the time I'm losing is not due to slow FP calculation but the huge micropython overhead for each calculation.

    There is a difference on how MicroPython treats float and integer. It will always allocate a memory chunk for a float number, whereas small integers, < 2**30, are kept in bytecode objects themselves and do not allocate memory. So for the pure arithmetic you can expect an speed increase by a factor of 5.

    And as said, putting code into the frozen folder should not affect the runtime speed. I do it a lot and I never noticed a faster code execution. It lets the code start faster and does not use memory during parsing. So it is most helpful on devices with small RAM.



  • @crumble said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    you can place precompiled code either in the frozen

    You have to put .py code into the frozen folder. It will be cross-compiled as part of the build process. As far as I could tell, pre-compiled code from the normal file system only runs faster on devices without SPI RAM, like LoPy1 or WiPy2. But these are very limited in heap space.



  • I will first try using the frozen folder, I let you know how much it speed up.
    512 points is 17 seconds so that's the reference.
    I don't expect enough speed up so I may need to try to add the c-functions that should be much faster.

    I don't expect much of a speed difference with integers instead of floats because the time I'm losing is not due to slow FP calculation but the huge micropython overhead for each calculation.

    I have an Ubuntu VM prepared now to do the compiling job. (fingers crossed)

    Edit: using Ubuntu on Windows 10 instead, I created a new lopy4.bin (no extra c yet) looking for the best way to load the firmware with the tool that needs a .tar file with bootloader.bin. I hope I can just not provide those or reuse those that pycom uses. (access to the com port from Ubuntu would also avoid needing the tool)
    Yes, having a linux bootable PC would make it much easier.



  • You need some kind of unix for the best UX. LinuxOnWindows works as well.

    The Firmware on GitHub has a good HowTo. The are only two pitfalls. Don't forget to do the git submodule after each switch of the branch. Something from the toolcahin uses still python2.

    As far as I understood, you can place precompiled code either in the frozen folder or the normal folder structure. In the normal one it will be copied into RAM before runtime, in the frozen folder it will run directly from the flash. RAM shall IMHO be faster. I precompiled only for less fragmentation, so I never benchmarked the difference. Same for the -O option of the cross compiler.

    Your code uses a lot of float objects, so it will depend on RAM I/O when you use your micropython implementation. If you run it in a background thread, you can call micropython.mem_info() to see how memory is filled with float objects and the GC is called as soon as every block is occupied. This will be more than 90% of your runtime. Your code will be much faster, if you can implement it with ints only.

    You will find a how to add C functions on this MicroPython page



  • @KenM Frozen byte code does not execute faster, it just loads faster.



  • @crumble said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    @KenM
    IMHO you clone the firmware, add your stuff and hope that automatic merge will work most of the time.

    Pycom told me to try the 'frozen' folder to increase speed.
    this would keep it in python and avoid the complexity of adding the interface to micropython.

    But I don't know which tools I should best use and how I should use them to create a proper lopy4.bin file though.
    Edit: instructions are here: https://github.com/pycom/pycom-micropython-sigfox/blob/master/README.md



  • @robert-hh said in Lopy4 FFT (spectrum analysis) micropython, adding C code?:

    @KenM Have a look at the implementation of Peter Hinch: https://github.com/peterhinch/micropython-fourier
    You may also contact him directly about his code structure to improve the speed. Taking the 12ms, his code needs for a 1024 point fft written in assembler, a time in the second range should be achievable with python code. Avoiding allocation, like @crumble said, is surely one of the methods.

    I saw that link but there are two problems with that link

    1. its ARM assembler (this could in theory be reworked) edit: https://github.com/espressif/esp-dsp/blob/master/modules/fft/float/dsps_fft2r_fc32_ae32.S
    2. I don't think pycom supports executing assembler via micropython.


  • @KenM

    The post processing after the FFT only took 0.02 seconds in Python so I don't think it that needs to be put in C, so the pure FFT functions are the problem.

    It is not only about the time. The float object can fragment the memory. Depending of your code, the device may not be able to allocate such a huge array. As a LoPy 1 user I would move most of the float stuff into C, if I have to do this anyway. But you 4MB guys may have more luck ;)

    Using only the FFT stuff in python may be better for developing. So you don't need to upload the firmware that often.

    How to keep that working with firmware updates?

    IMHO you clone the firmware, add your stuff and hope that automatic merge will work most of the time.



  • @KenM Have a look at the implementation of Peter Hinch: https://github.com/peterhinch/micropython-fourier
    You may also contact him directly about his code structure to improve the speed. Taking the 12ms, his code needs for a 1024 point fft written in assembler, a time in the second range should be achievable with python code. Avoiding allocation, like @crumble said, is surely one of the methods.



  • I tried a micropython FFT algorithm of 2048 points and it took 4m53s to complete (yes almost 5 minutes), hence the need for the C version.

    Should this be incorporated in the lopy4.bin or is this a .bin beside it?
    How to keep that working with firmware updates?
    Does pycom provide info for this?

    The post processing after the FFT only took 0.02 seconds in Python so I don't think it that needs to be put in C, so the pure FFT functions are the problem.

    Thanks



  • @KenM

    keep in mind that the float implementation works like similar like the string implementation. Each expression will create at least one new object. even

    x += 1.0
    

    will generate a new object. So you have to deal with a heavy fragmented memory. If you can do most of your float stuff around the fft code, move them as well to C. So it will be easier to allocate mem.


Log in to reply
 

Pycom on Twitter