utf-8 iteration / ord() behaves unexpected



  • Hi, I'm trying to iterate over a string containing German Umlaute (äöü) to get the matching unicode position as Decimal.

    It's working for single characters

    print(ord(b'ä'))
    print(ord(bytes('ä', "utf-8")))
    

    outputs

    228
    228

    It doesn't, however, work once I iterate over the characters.

    text = "aldköäü"
    for index in range(len(text)):
        char = text[index]
        print("char:")
        print(char)
        print(char.encode("utf-8"))
        print(ord(bytes(char, "utf-8")))
    

    outputs

    char:
    a
    b'a'
    97
    char:
    l
    b'l'
    108
    char:
    d
    b'd'
    100
    char:
    k
    b'k'
    107
    char:
    ���
    b'\xf6\xe4\xfc\x00'
    Traceback (most recent call last):

    File "<stdin>", line 52, in <module>

    TypeError: ord() expected a character, but string of length 4 found

    How can I get this to work? I'm using a WiPy 3.0 and plan to use the same code on a GPy



  • For future reference: Here's a solution.

    text = "aouäöü"
    
    print(text)
    print(len(text))
    
    for char in text:
        print(char)
        print(ord(bytes(char, "utf-8")))
    

Log in to reply
 

Pycom on Twitter