What is the replacement of deprecated array of unicode chars?

Question

An array of unicode characters can be used as a mutable string :

import array

ins = "Aéí"
ms = array.array('u', ins)
ms[0] = "ä"
outs = ms.tounicode()
# äéí

But type 'u' is deprecated since Python 3.3. What is the modern replacement?

I could do:

ms = list(ins)
# mutate
outs = ''.join(ms)

But I find a list of characters very memory inefficient compared to the array.

Alternatively:

ms = array.array('L', (ord(ch) for ch in ins))
ms[0] = ord("ä")
outs = "".join(chr(ch) for ch in ms)

But it is far less readable than the deprecated original.

`numpy` accepts Unicode strings, allocating 4 bytes ped character, a constant `n` characters per element. With all that padding it probably isn't as memory efficient as a list. — hpaulj, Oct 29 '17 at 07:04
@hpaulj 4 bytes per character is just fine, it's like UTF-32 or wchar_t in C language. But `sys.getsizeof('x')` is 50 on my PC, and `getsizeof('ä')` is 74. It takes **so many** bytes per one list item, because every character in the list is actually a complete string object. — VPfB, Oct 29 '17 at 07:41
It was deprecated because Python unicode strings are no longer a fixed size. Their width depends on the highest Unicode codepoint in the string now (1, 2 or 4 bytes per codepoint). — Martijn Pieters, Nov 01 '17 at 17:34
Why do you need this to be an `array`? Why not use a `bytearray` object instead, and encode your data to bytes? — Martijn Pieters, Nov 01 '17 at 17:34
@MartijnPieters I will miss the simplicity of `mutable_string[position] = ch`. And also "deprecated" usually means going to be replaced by something better. I could not find out what is that new thing. — VPfB, Nov 01 '17 at 17:57

Mark Tolonen · Accepted Answer · 2017-11-01T18:13:20.130

1

This is similar to your last example, but initializes the array in a more readable and efficient way. You must choose an array size that is four bytes in size. I is the code for unsigned int and is four bytes on most OSes. For portability you may want to choose this value programmatically.

#!coding:utf8
import array
import sys

# Verifying the item size.
assert array.array('I').itemsize == 4

# Choose encoding base on native endianness:
encoding = 'utf-32le' if sys.byteorder == 'little' else 'utf-32be'

ins = "Aéí"
ms = array.array('I',ins.encode(encoding))
ms[0] = ord('ä')
print(ms.tobytes().decode(encoding))

Output:

äéí

Timings for a 1000-element string show it is quite a bit faster:

In [7]: s = ''.join(chr(x) for x in range(1000))

In [8]: %timeit ms = array.array('I',s.encode('utf-32le'))
1.77 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]: %timeit ms = array.array('I',(ord(x) for x in s))
167 µs ± 5.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit outs = "".join(chr(x) for x in ms)
194 µs ± 4.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [23]: %timeit outs = ms.tobytes().decode('utf-32le')
3.92 µs ± 97 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

But you may be overthinking it. I don't know what string sizes you are dealing with, but just using list(data) is faster, if less memory efficient, and it isn't that bad. Here's a list of non-BMP characters (~1M), and timings for immutable string slicing, mutating an array, and mutating a list:

In [67]: data = ''.join(chr(x) for x in range(0x10000,0x110000))

In [68]: ms = array.array('I',data.encode('utf-32le'))

In [69]: %%timeit global data
    ...: data = data[:500] + 'a' + data[501:]
    ...:
3.33 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [70]: %timeit ms[500] = ord('a')
73.6 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [71]: %%timeit v = list(data)
    ...: v[500] = 'a'
    ...:
28.7 ns ± 0.144 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [72]: sys.getsizeof(data)
Out[72]: 4194380

In [73]: sys.getsizeof(ms)
Out[73]: 4456524

In [74]: sys.getsizeof(list(data))
Out[74]: 9437296

Mutating a list is straightforward, 3x faster than mutating the array, and only uses a little more that 2x the memory.

edited Nov 01 '17 at 18:13

answered Oct 31 '17 at 16:22

Mark Tolonen

166,664
26
169
251

Yes, this is a substantial improvement indeed. Thank you for your helpfull answer. – VPfB Oct 31 '17 at 16:30
In python 3.6 `array('L', s.encode('utf-32le'))` will raise a `ValueError` bytes length not a multiple of item size. And why use `L` rather than `B` to store the `encode()` result, a byte sequence? – Jacky1205 Nov 01 '17 at 02:59
@Jacky it is to index the 32-bit values in the array, and the items are multiples of 4 bytes, so I don't know what you did to achieve that. – Mark Tolonen Nov 01 '17 at 04:39
@Mark The same code as you written, and I tried it in https://www.python.org/ online console. And by the way which version you used? – Jacky1205 Nov 01 '17 at 05:17
@Jacky Python 3.6.3 – Mark Tolonen Nov 01 '17 at 05:34
@MarkTolonen could you try the above code (in your post) on https://www.python.org/ online console. I do not know why the `ValueError` would be raised – Jacky1205 Nov 01 '17 at 05:37
1

@Jacky The python.org online console uses Linux, where `unsigned long` is 8 bytes and I am on Windows where `unsigned long` is 4 bytes. Using type `I` should work on both systems for 4 bytes unless you have a 16-bit compiler :^) – Mark Tolonen Nov 01 '17 at 05:53
@Jacky I should say 64-bit Linux...IIRC 32-bit linux matches Windows. – Mark Tolonen Nov 01 '17 at 05:58
@MarkTolonen I have found two issues: 1. the native endianness must be used to make this code usable on a different architecture. The reason is the assignment of `ord(X)` value. Simply changing `utf-32be` to `utf-32` is not good enough, becuase it introduces BOM. Issue 2) `'L'` size is 32 bit **or more**. On my PC it's 64 bit. I have 2 chars per item. Try to change `ms[1]` instead of `ms[0]` to see the error. – VPfB Nov 01 '17 at 08:37
@VPfB see the recent comments. I changed to `I` type since unsigned int is typically 4 bytes on Windows and Linux, both 32- and 64-bit. You can query the size as well to choose an appropriate type. You can query the endianness of the machine as well. – Mark Tolonen Nov 01 '17 at 08:51
@MarkTolonen I see the change now. Would be nice if you would note in the answer that the code is not portable because two parameters must be tuned for each system. – VPfB Nov 01 '17 at 08:59

Jacky1205 · Answer 2 · 2017-10-31T10:39:46.380

0

The replacement in Python3 is str, the builtin type. Strings are immutable sequences of Unicode code points.

As a result, there is no need to use array('u', ...), str is enough.

By the way, the reason of removing 'u' (use PyUnicode* internally) is the Py_UNICODE* representation is deprecated and inefficient; it should be avoided in performance- or memory-sensitive situations.

edited Oct 31 '17 at 10:39

answered Oct 31 '17 at 10:11

Jacky1205

3,273
3
22
44

Could you please show a working equivalent of the code posted in the question? – VPfB Oct 31 '17 at 10:20
Since str is immutable, so we need to create a new string object via `outs = 'ä' + ins[1:]` – Jacky1205 Oct 31 '17 at 10:37

SRC · Answer 3 · 2017-10-31T11:37:39.980

Python 3.3+ has deprecated the use of Unicode separately. Because of PEP393 and that is why everything can be expressed and used as the str.

Here is a nice How to about the Unicode handling in post-Python 3.3 interpreters. By definition -

The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:

And finally, I am not sure what you want to achieve, but does the following do the work?

ins = "Aéí"
m = "ä" + ins[1:]
print(m)

In a Python 3.5.2 version it gives me this output -

äéí

Please let me know if this helps

As a final note, if you are wondering about the + operator, then here is a nice SO post about that. The gist of the most voted answer is : If you are using Cpython, you better use + than append join etc.

If you want to use your apprach, there is still the entire array module available with all the things you needed to run our code as it is. I tested it in python3.5 and it works.

There is this way also-

a = "ä"
p = b"".join([a.encode('utf-8'), ins[1:].encode('utf-8')])
print(p.decode('utf-8'))

I made few tests, the list needs approx 20x more memory and 70x more time (on my PC, 100_000 chars, YMMV). This is clearly a step back from the approach now deprecated. — VPfB, Oct 31 '17 at 10:51

What is the replacement of deprecated array of unicode chars?

3 Answers3

Linked