Iterating over Unicode Characters

Question

I wanted to loop over Unicode-Characters in Python like this:

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                my_char = r"\u" + _1 + _2 + _3 + _4
                print(my_char)

As expected this printed out:

\u0000
\u0001
...
\uffff

Then I tried to change the code above to print not the Unicode but the corresponding Characters:

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                my_char = r"\u" + _1 + _2 + _3 + _4
                eval("print(my_char)")

But this outputs the same as the code before.

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                eval("print(" + r"\u" + f"{_1}{_2}{_3}{_4})")

And something like this raises following errow message:

eval("print(" + r"\u" + f"{_1}{_2}{_3}{_4})")
  File "<string>", line 1
    print(\u0000)
                ^
SyntaxError: unexpected character after line continuation character

What would make this code work as intended?

Fiddling with `eval`ing string literals smells like an [XY problem](https://meta.stackexchange.com/q/66377/478746). Why not use `chr(codepoint)`? — Brian61354270, Feb 21 '23 at 15:25
@Brian To be clear, `codepoint` needs to be an int, which can be got with `int(f"{_1}{_2}{_3}{_4})", 16)` — wjandrea, Feb 21 '23 at 15:27
Python strings are Unicode. All characters are Unicode characters. Unicode isn't some kind of escape sequence, it's a way of mapping characters to bytes. — Panagiotis Kanavos, Feb 21 '23 at 15:27
Also, note that `eval("print(my_char)")` is the same as `print(my_char)` it's just printing the string contents of the variable `my_char` — Brian61354270, Feb 21 '23 at 15:27
Why are you using nested loops in the first place when you could just be looping over numbers? `for codepoint in range(0xffff): ...`. Or you could at least use [`product`](https://docs.python.org/3/library/itertools.html#itertools.product) instead of a nested loop. — wjandrea, Feb 21 '23 at 15:28
The error is telling you that the *escape sequence* you constructed is invalid. It says nothing about the NUL character you tried to create — Panagiotis Kanavos, Feb 21 '23 at 15:28
Given the *fact* that Python strings are Unicode, you can use [chr](https://docs.python.org/3/library/functions.html#chr) to convert a Unicode code point to a string with that character, eg `print(chr(1081))`. You can iterate from `0` to whatever number you want to generate characters — Panagiotis Kanavos, Feb 21 '23 at 15:31
"Mandatory" background reading: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) — Brian61354270, Feb 21 '23 at 15:35
Why are you expecting `\u0000` to work? Strings need to be quoted, i.e. `'\u0000'`. Did you just forget to add the quote marks? `eval(fr"print('\u{_1}{_2}{_3}{_4}')")` — wjandrea, Feb 21 '23 at 15:35
You aren't iterating over Unicode characters in the original code. You are iterating over regular ASCII characters and constructing strings that look like escape sequences used to indicate Unicode characters in string literals. Two *very* different things. — chepner, Feb 21 '23 at 15:36
Does this answer your question? [Process escape sequences in a string in Python](https://stackoverflow.com/questions/4020539/process-escape-sequences-in-a-string-in-python) — Abdul Aziz Barkat, Feb 21 '23 at 15:55

score -1 · Accepted Answer · edited Feb 21 '23 at 15:48

-1

Python strings are Unicode already. Unicode isn't some kind of escape sequence, it's a way of mapping characters to bytes.

Given that fact, you can use chr to convert a Unicode code point to a string with that character, eg print(chr(1081)). As the function's docs say:

Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().

The valid range for the argument is from 0 through 1,114,111

A simple loop can generate all valid characters. Actually printing them is another matter:

for i in range(0, 1114112 ):
    print(chr(i))

Running this on my machine eventually fails with

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

That value couldn't be converted in a form that can be printed on my terminal, which uses UTF8

edited Feb 21 '23 at 15:48

wjandrea

28,235
9
60
81

answered Feb 21 '23 at 15:35

Panagiotis Kanavos

120,703
13
188
236

It's easier with hex: `range(0x110000)` – wjandrea Feb 21 '23 at 15:42
I used the value from the documentation. – Panagiotis Kanavos Feb 21 '23 at 15:44
The docs also have the hex version: *"1,114,111 (0x10FFFF in base 16)."* – wjandrea Feb 21 '23 at 15:46
D800-DFFF are reserved for UTF-16 surrogates – Mark Tolonen Feb 21 '23 at 16:06
1

*A simple loop can generate **all valid characters**.* Disagree if you are looping over `range(0x110000)`. What about [noncharacters](https://www.unicode.org/faq/private_use.html#noncharacters)? – JosefZ Feb 21 '23 at 17:23
@JosefZ take it up with whoever edits the Python docs I suppose? There's a `Report a Bug` link in the docs page. The doc says `The valid range for the argument is from 0 through 1,114,111` – Panagiotis Kanavos Feb 22 '23 at 07:51

Matt Pitkin · Answer 2 · 2023-02-21T15:44:49.070

-1

I'd recommend using itertools in this case, and then bytearray.fromhex (as shown here), e.g.,

from itertools import product

for comb in product("012346789abcdef", repeat=4):
    print(bytearray.fromhex(rf"{''.join(comb)}").decode())

although this raises the same error as in @Panagiotis's answer. To get round the error you can use a try... except... block, e.g.:

for comb in product("012346789abcdef", repeat=4):
    try:
        print(bytearray.fromhex(rf"{''.join(comb)}").decode())       
    except UnicodeDecodeError:
        pass

edited Feb 21 '23 at 15:44

answered Feb 21 '23 at 15:35

Matt Pitkin

3,989
1
18
32

1

While this is more concise than the OPs code, it still produces the same wrong output – Brian61354270 Feb 21 '23 at 15:36
1

Thanks, I've (hopefully) fixed it – Matt Pitkin Feb 21 '23 at 15:45

Iterating over Unicode Characters

2 Answers2