29

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )
John Machin
  • 81,303
  • 11
  • 141
  • 189
l0b0
  • 55,365
  • 30
  • 138
  • 223
  • It might help if you were to give some more detail on "test the Unicode handling of my code" and explain what is the part that generating random UTF-8 strings has to play in that testing, and what you regard as "the entire Unicode range" (16 bits? 21 bits? non-surrogate code-points? valid chars (e.g. not U+FFFF)?). Do you trust the Python UTF-8 codec, or do you need to test that too? Python 2.X or 3.X or both? – John Machin Sep 25 '09 at 23:10
  • 1
    The goal is to accept any printable, valid Unicode code points (characters) as input for a web interface in Python 2.6. – l0b0 Sep 28 '09 at 16:05
  • 2021 updated link to the table of General Unicode Codepoint Category Values: https://www.unicode.org/reports/tr44/#GC_Values_Table – Matthew Willcockson Dec 18 '21 at 01:22

8 Answers8

25

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))
Jacob Wan
  • 2,521
  • 25
  • 19
10

There is a UTF-8 stress test from Markus Kuhn you could use.

See also Really Good, Bad UTF-8 example test data.

Community
  • 1
  • 1
Gumbo
  • 643,351
  • 109
  • 780
  • 844
9

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Philipp
  • 48,066
  • 12
  • 84
  • 109
7

Follows a code that print any printable character of UTF-8:

print(''.join(tuple(chr(i) for i in range(32, 0x110000) if chr(i).isprintable())))

All printable characters are included above, even those that are not printed by the current font. The clause and not chr(i).isspace() can be added to filter out whitespace characters.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
aluriak
  • 5,559
  • 2
  • 26
  • 39
  • That's not actually going to give you a random string, although of course you could just use [`random.sample`](https://docs.python.org/2/library/random.html#random.sample) instead of `print`. – l0b0 Sep 25 '16 at 08:32
  • So use `random.choices` instead. – gimboland Jan 30 '18 at 13:20
3

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

Since Unicode is just a range of - well - codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

Joril
  • 19,961
  • 13
  • 71
  • 88
  • 3
    Unfortinately, it's not so simple. Unicode contains much more than 0x100000 characters, and the range is not connected. For example, the surrogate values must never appear as single code points. So the question of what forms a valid UTF-8 string is highly nontrivial. The details are described in definition D92 of Chapter 3 of the Unicode Standard. There is also a table (3–7)) that lists all valid possibilities for UTF-8 byte sequences. – Philipp Sep 25 '09 at 13:54
  • Unicode runs from U+0000 to U+10FFFF; there are also numerous code points that are not valid, including (as it happens) U+FFFF. The Unicode standard says of it " - the value FFFF is guaranteed not to be a Unicode character at all". – Jonathan Leffler Sep 25 '09 at 13:58
0

You could download a website written in greek or german that uses unicode and feed that to your code.

Esteban Küber
  • 36,388
  • 15
  • 79
  • 97
0

Answering revised question:

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

John Machin
  • 81,303
  • 11
  • 141
  • 189