lora.join does not attempt every 15 seconds in background

Graham

If lora.join fails on the first attempt, it does not attempt again every 15 seconds as suggested in Pycom docs.

Example code:

lora.join(activation=LoRa.OTAA, auth=(app_eui, app_key),timeout=0)
# wait for a connection to be established
while not lora.has_joined():
    pass

Which will hang forever if the first join fails (which happens to us often enough during educational workshops with LoPy4).

The docs say Internally the stack will automatically retry every 15 seconds until a Join Accept message is received. - which implies no wizardry is needed for the 15s intervals to join.

If I am overlooking something. any tips are welcomed!

robert-hh

@graham said in lora.join does not attempt every 15 seconds in background:

while not lora.has_joined():
pass

I would add something like sleep(1) into the loop, giving the background tasks more time.

Graham

@emmanuel-florent here's the code. I've tested this many times with 4x LoPy4's before posting; pressing the reset button at the same time on them all. Almost always, one or two devices will not connect and do not recover.

Gateway logs show 1 join request from the failed device(s), I watched the logs for two minutes each time awaiting any signs of life (there were none).

from network import LoRa
import pycom
import binascii
import socket

lora = LoRa(mode=LoRa.LORAWAN, region=LoRa.AS923)
app_eui = binascii.unhexlify('XXXXXXXXXXXXXX')
app_key = binascii.unhexlify('XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')

lora.join(activation=LoRa.OTAA, auth=(app_eui, app_key), timeout=0)

print('Waiting for LoRaWAN network connection...')
while not lora.has_joined():
    pass

print('Network joined!')

while True:
    pass

Emmanuel Florent

@graham please let me look the code for the workshop you can write me at emmanuel at pycom dot io.

robert-hh

@graham I have never seen that behavior. Did you check that the various board you are using have a different dev-eui, when manually set?

About Ctrl-C: The Lora stack/device is not fully reset on Ctrl-C. You have to push hard reset (button) or call machine.reset().

Graham

@harald The Gateway (more specifically, TTN console data logs) show a single join event. No additional join events occur in TTN console after this issue is invoked (which suggests the LoPy4 is the issue).

@emmanuel-florent I've omitted the region settings to just show the minimal code. We're heavy-users of LoPy hardware; everything upstream of the join event is fine.

@robert-hh I agree there is a collision, though it's not attempting every 15 seconds. A failed connection never recovers, even if other failed devices are turned off (leaving 1 failed device "on" to attempt its own recovery).

It does appear more-and-more that there's an issue in the stack for this scenario. With REPL connected, sending a CTRL+C during a failure condition

Left as is, this bug could be viewed as a DDoS layer for Pycom LoPy4. Devices that attempt to connect on power-up could find themselves in this situation and never "connect" using any of the recommended examples in Pycom Docs. I doubt many people have experienced the issue as they are not holding educational workshops that invoke more-than-normal LoRaWAN events over a short time.

Emmanuel Florent

The firmware does repeat the join request.
Uplink join frequency (in Mhz) depends of each regional specification same for retry frequency (in seconds)
Depending on your setup (region in particular) you have to make a particular setup so what regional specification do you target ?
Example for US_915 subband 1 you would OTAA join on channel 64 and then that would be:

    lora.add_channel(int(upstream.get('chan')), frequency=int(upstream.get('fq')), dr_min=0, dr_max=int(data_rate))  
    for index in range(0, 71):
        if index != upstream.get('chan'):
            lora.remove_channel(index)


        # create an OTA authentication params
        dev_eui = binascii.unhexlify(self.config["lora"]["otaa"]['dev_eui'])
        app_key = binascii.unhexlify(self.config["lora"]["otaa"]['app_key'])
        nwk_key = binascii.unhexlify(self.config["lora"]["otaa"]['nwk_key'])

        start_join_time = time.time()

        # lora.nvram_restore()

        if not lora.has_joined():
            if self.config["lora"]["region"] == "US915":
                lora.join(activation=LoRa.OTAA, auth=(dev_eui, app_key, nwk_key), timeout=0, dr=0)
            else:
                lora.join(activation=LoRa.OTAA, auth=(dev_eui, app_key, nwk_key), timeout=0, dr=int(self.config["lora"]["data_rate"]))


        # wait until the module has joined the network
        while not lora.has_joined():
            time.sleep(1)
            pycom.rgbled(0x000000)
            print('.', end='')
            time.sleep(1)
            pycom.rgbled(0x0000ff)

        lora.nvram_save()



    for i in range(0, 8):
        fq = 902300000 + (i * 200000)
        self.lora.add_channel(i, frequency=fq, dr_min=0, dr_max=self.config["lora"]["data_rate"])

Harald

What do you experience if you power up your gateway after the first join attempt? I seem to have the same issue. Only one join attempt

robert-hh

@graham It looks for me as if two devices try to join at the same time over and over again, and then the join messages get lost due to collision. An indirect proof for that would be, if every device joins when started by itself. So the reset you used as workaround would then force the devices out of their malicious sync.

Graham

It happens on different gateways, so I'm not leaning in that direction yet. I can very reliably trigger the fault by powering up 5x LoPy4's all at the same time. At least one or two of the last-to-be-powered devices will hang.

The only semi-reliable method we have for educational workshops (where 10+ devices are being powered on/off at the same time) is this:

print("Device EUI: " + binascii.hexlify(lora.mac()).upper().decode('utf-8'))

while not lora.has_joined():
    # if no connection in a few seconds, then reboot
    if utime.time() > 15:
        print("possible timeout")
        machine.reset()
    pass

Although it looks and feels like a dity workaround for something that is breaking somewhere at the OS-level.

Anyone with 5+ LoPy4's to test could replicate the issue easily and reliably. This failure state triggers all-the-time, it doesn't seem to be dependant on the Gateway model.

If this happened to a device in the field, without our workaround, it would remain in a hung state indefinitely. That, erm, concerns me a little.

robert-hh

@graham I do not agree. If I look at the traffic from a node, e.g. in the TTN console or at the gateway, you'll see join messages every 15 seconds. Which gateway are you using?

Graham

Fyi, here's my os.uname() on the LoPy4:

sysname='LoPy4', nodename='LoPy4', release='1.18.1.r1', version='v1.8.6-849-b0520f1 on 2018-08-29', machine='LoPy4 with ESP32', lorawan='1.0.2', sigfox='1.0.1'

Explore Pybytes | Official Documentation | Report a Firmware Bug/Issue | GitHub

lora.join does not attempt every 15 seconds in background

Pycom on Twitter