Generate random UTF-8 string in Python

Question

I'd like to test the Unicode handling of my code. Is there anything I can put in random.choice() to select from the entire Unicode range, preferably not an external module? Neither Google nor StackOverflow seems to have an answer.

Edit: It looks like this is more complex than expected, so I'll rephrase the question - Is the following code sufficient to generate all valid non-control characters in Unicode?

unicode_glyphs = ''.join(
    unichr(char)
    for char in xrange(1114112) # 0x10ffff + 1
    if unicodedata.category(unichr(char))[0] in ('LMNPSZ')
    )

It might help if you were to give some more detail on "test the Unicode handling of my code" and explain what is the part that generating random UTF-8 strings has to play in that testing, and what you regard as "the entire Unicode range" (16 bits? 21 bits? non-surrogate code-points? valid chars (e.g. not U+FFFF)?). Do you trust the Python UTF-8 codec, or do you need to test that too? Python 2.X or 3.X or both? — John Machin, Sep 25 '09 at 23:10
The goal is to accept any printable, valid Unicode code points (characters) as input for a web interface in Python 2.6. — l0b0, Sep 28 '09 at 16:05
2021 updated link to the table of General Unicode Codepoint Category Values: https://www.unicode.org/reports/tr44/#GC_Values_Table — Matthew Willcockson, Dec 18 '21 at 01:22

Jacob Wan · Answer 1 · 2015-07-23T21:37:23.177

People may find their way here based mainly on the question title, so here's a way to generate a random string containing a variety of Unicode characters. To include more (or fewer) possible characters, just extend that part of the example with the code point ranges that you want.

import random

def get_random_unicode(length):

    try:
        get_char = unichr
    except NameError:
        get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        ( 0x0021, 0x0021 ),
        ( 0x0023, 0x0026 ),
        ( 0x0028, 0x007E ),
        ( 0x00A1, 0x00AC ),
        ( 0x00AE, 0x00FF ),
        ( 0x0100, 0x017F ),
        ( 0x0180, 0x024F ),
        ( 0x2C60, 0x2C7F ),
        ( 0x16A0, 0x16F0 ),
        ( 0x0370, 0x0377 ),
        ( 0x037A, 0x037E ),
        ( 0x0384, 0x038A ),
        ( 0x038C, 0x038C ),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
            for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))

if __name__ == '__main__':
    print('A random string: ' + get_random_unicode(10))

Thank you, Jacob. Would there be any issues in running this code in Python 2.7? — morfys, Jul 22 '15 at 21:49
@morfys It didn't, but I just edited it so it does. Thanks for asking. — Jacob Wan, Jul 23 '15 at 21:32

score 10 · Accepted Answer · edited May 23 '17 at 12:34

10

There is a UTF-8 stress test from Markus Kuhn you could use.

See also Really Good, Bad UTF-8 example test data.

edited May 23 '17 at 12:34

Community

1
1

answered Sep 25 '09 at 13:47

Gumbo

643,351
109
780
844

That would be usefull to ensure that the program doesn't break when given incorrect text, but it wouldn't help as a comformance test. – Esteban Küber Sep 25 '09 at 13:49
+1. l0b0: don't worry about generating random unicode. Borrowing someone else's wheel > reinventing it. – Matt Ball Sep 25 '09 at 13:53
4

Good answer, but doesn't actually answer the question as asked. – Kylotan Nov 24 '12 at 21:08
Download of the file is blocked on Mac Chrome 54.0.2840.59, but you can still view it by clicking on it. – Cat Zimmermann Oct 15 '16 at 20:35

score 9 · Answer 3 · edited Oct 02 '09 at 04:09

Here is an example function that probably creates a random well-formed UTF-8 sequence, as defined in Table 3–7 of Unicode 5.0.0:

#!/usr/bin/env python3.1

# From Table 3–7 of the Unicode Standard 5.0.0

import random

def byte_range(first, last):
    return list(range(first, last+1))

first_values = byte_range(0x00, 0x7F) + byte_range(0xC2, 0xF4)
trailing_values = byte_range(0x80, 0xBF)

def random_utf8_seq():
    first = random.choice(first_values)
    if first <= 0x7F:
        return bytes([first])
    elif first <= 0xDF:
        return bytes([first, random.choice(trailing_values)])
    elif first == 0xE0:
        return bytes([first, random.choice(byte_range(0xA0, 0xBF)), random.choice(trailing_values)])
    elif first == 0xED:
        return bytes([first, random.choice(byte_range(0x80, 0x9F)), random.choice(trailing_values)])
    elif first <= 0xEF:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF0:
        return bytes([first, random.choice(byte_range(0x90, 0xBF)), random.choice(trailing_values), random.choice(trailing_values)])
    elif first <= 0xF3:
        return bytes([first, random.choice(trailing_values), random.choice(trailing_values), random.choice(trailing_values)])
    elif first == 0xF4:
        return bytes([first, random.choice(byte_range(0x80, 0x8F)), random.choice(trailing_values), random.choice(trailing_values)])

print("".join(str(random_utf8_seq(), "utf8") for i in range(10)))

Because of the vastness of the Unicode standard I cannot test this thoroughly. Also note that the characters are not equally distributed (but each byte in the sequence is).

score 7 · Answer 4 · edited May 31 '19 at 19:57

7

Follows a code that print any printable character of UTF-8:

print(''.join(tuple(chr(i) for i in range(32, 0x110000) if chr(i).isprintable())))

All printable characters are included above, even those that are not printed by the current font. The clause and not chr(i).isspace() can be added to filter out whitespace characters.

edited May 31 '19 at 19:57

Asclepius

57,944
17
167
143

answered Sep 25 '16 at 00:57

aluriak

5,559
2
26
39

That's not actually going to give you a random string, although of course you could just use [`random.sample`](https://docs.python.org/2/library/random.html#random.sample) instead of `print`. – l0b0 Sep 25 '16 at 08:32
So use `random.choices` instead. – gimboland Jan 30 '18 at 13:20

Jonathan Leffler · Answer 5 · 2009-09-25T14:01:02.740

It depends how thoroughly you want to do the testing and how accurately you want to do the generation. In full, Unicode is a 21-bit code set (U+0000 .. U+10FFFF). However, some quite large chunks of that range are set aside for custom characters. Do you want to worry about generating combining characters at the start of a string (because they should only appear after another character)?

The basic approach I'd adopt is randomly generate a Unicode code point (say U+2397 or U+31232), validate it in context (is it a legitimate character; can it appear here in the string) and encode valid code points in UTF-8.

If you just want to check whether your code handles malformed UTF-8 correctly, you can use much simpler generation schemes.

Note that you need to know what to expect given the input - otherwise you are not testing; you are experimenting.

score 0 · Answer 6 · answered Sep 25 '09 at 13:44

0

Since Unicode is just a range of - well - codes, what about using unichr() to get the unicode string corresponding to a random number between 0 and 0xFFFF?
(Of course that would give just one codepoint, so iterate as required)

answered Sep 25 '09 at 13:44

Joril

19,961
13
71
88

3

Unfortinately, it's not so simple. Unicode contains much more than 0x100000 characters, and the range is not connected. For example, the surrogate values must never appear as single code points. So the question of what forms a valid UTF-8 string is highly nontrivial. The details are described in definition D92 of Chapter 3 of the Unicode Standard. There is also a table (3–7)) that lists all valid possibilities for UTF-8 byte sequences. – Philipp Sep 25 '09 at 13:54
Unicode runs from U+0000 to U+10FFFF; there are also numerous code points that are not valid, including (as it happens) U+FFFF. The Unicode standard says of it " - the value FFFF is guaranteed not to be a Unicode character at all". – Jonathan Leffler Sep 25 '09 at 13:58

score 0 · Answer 7 · answered Sep 25 '09 at 13:45

0

You could download a website written in greek or german that uses unicode and feed that to your code.

answered Sep 25 '09 at 13:45

Esteban Küber

36,388
15
79
97

score 0 · Answer 8 · answered Sep 28 '09 at 14:47

0

Answering revised question:

Yes, on a strict definition of "control characters" -- note that you won't include CR, LF, and TAB; is that what you want?

Please consider responding to my earlier invitation to tell us what you are really trying to do.

answered Sep 28 '09 at 14:47

John Machin

81,303
11
141
189

Generate random UTF-8 string in Python

8 Answers8

Linked