11

This is of those "mostly asked out of pure curiosity (in possibly futile hope I will learn something)" questions.

I was investigating ways of saving memory on operations on massive numbers of strings, and for some scenarios it seems like string operations in numpy could be useful. However, I got somewhat surprising results:

import random
import string

milstr = [''.join(random.choices(string.ascii_letters, k=10)) for _ in range(1000000)]

npmstr = np.array(milstr, dtype=np.dtype(np.unicode_, 1000000))

Memory consumption using memory_profiler:

%memit [x.upper() for x in milstr]
peak memory: 420.96 MiB, increment: 61.02 MiB

%memit np.core.defchararray.upper(npmstr)
peak memory: 391.48 MiB, increment: 31.52 MiB

So far, so good; however, timing results are surprising for me:

%timeit [x.upper() for x in milstr]
129 ms ± 926 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np.core.defchararray.upper(npmstr)
373 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Why is that? I expected that since Numpy uses contiguous chunks of memory for its arrays AND its operations are vectorized (as the above numpy doc page says) AND numpy string arrays apparently use less memory so operating on them should at least potentially be more on-CPU cache-friendly, performance on arrays of strings would be at least similar to those in pure Python?

Environment:

Python 3.6.3 x64, Linux

numpy==1.14.1

scrpy
  • 985
  • 6
  • 23
LetMeSOThat4U
  • 6,470
  • 10
  • 53
  • 93
  • Comment: I am generally dubious about claims of operations being "vectorised" until tested & proven. 2 examples: [`np.vectorize`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html) (little more than a `map` replacement) and [`pd.Series.str`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.html) (source code just a loopy `lambda`). – jpp Mar 05 '18 at 14:38
  • Taking a peek at the [source code](https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c) (functions `_vec_string`, `_vec_string_no_args`, ...) there doesn't seem to be vectorized magic, just looping (not sure you can vectorize string ops anyway). I don't think strings are stored contiguously in arrays, since they can have different sizes, I guess their references are. I suppose the list comprehension has less overhead than creating a NumPy array, and the string operation itself takes about the same time. – jdehesa Mar 05 '18 at 14:39
  • The time performances seem to depend on the version of Python. It's possible a version-specific optimization at play. – Cong Ma Mar 05 '18 at 15:15
  • 4
    Strings are stored in the contiguous array databuffer, with padding if needed. But numpy uses the Python string methods to process them. It does not implement its own compiled character code, which is what we usually mean by true vectorization. `defchararray` is a convenience for arrays that are already string dtype, not a substitute for processing lists of strings. From the docs, "All of them are based on the string methods in the Python standard library" – hpaulj Mar 05 '18 at 15:55
  • @hpaulj - It seems like answer ;) – jezrael Mar 06 '18 at 06:28

1 Answers1

20

Vectorized is used in two ways when talking about numpy, and it`s not always clear which is meant.

  1. Operations that operate on all elements of an array
  2. Operations that call optimized (and in many cases multi-threaded) numerical code internally

The second point is what makes vectorized operations much faster than a for loop in python, and the multithreaded part is what makes them faster than a list comprehension. When commenters here state that vectorized code is faster, they're referring to the second case as well. However, in the numpy documentation, vectorized only refers to the first case. It means you can use a function directly on an array, without having to loop through all the elements and call it on each elements. In this sense it makes code more concise, but isn't necessarily any faster. Some vectorized operations do call multithreaded code, but as far as I know this is limited to linear algebra routines. Personally, I prefer using vectorized operatios since I think it is more readable than list comprehensions, even if performance is identical.

Now, for the code in question the documentation for np.char (which is an alias for np.core.defchararray), states

The chararray class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of dtype object_, string_ or unicode_, and use the free functions in the numpy.char module for fast vectorized string operations.

So there are four ways (one not recommended) to handle strings in numpy. Some testing is in order, since certainly each way will have different advantages and disadvantages. Using arrays defined as follows:

npob = np.array(milstr, dtype=np.object_)
npuni = np.array(milstr, dtype=np.unicode_)
npstr = np.array(milstr, dtype=np.string_)
npchar = npstr.view(np.chararray)
npcharU = npuni.view(np.chararray)

This creates arrays (or chararrays for the last two) with the following datatypes:

In [68]: npob.dtype
Out[68]: dtype('O')

In [69]: npuni.dtype
Out[69]: dtype('<U10')

In [70]: npstr.dtype
Out[70]: dtype('S10')

In [71]: npchar.dtype
Out[71]: dtype('S10')

In [72]: npcharU.dtype
Out[72]: dtype('<U10')

The benchmarks give quite a range of performance across these datatypes:

%timeit [x.upper() for x in test]
%timeit np.char.upper(test)

# test = milstr
103 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
377 ms ± 3.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npob
110 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
<error on second test, vectorized operations don't work with object arrays>

# test = npuni
295 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
323 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# test = npstr
125 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
125 ms ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npchar
663 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
127 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# test = npcharU
887 ms ± 8.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
325 ms ± 3.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Surprisingly, using a plain old list of strings is still the fastest. Numpy is competitive when the datatype is string_ or object_, but once unicode is included performance becomes much worse. The chararray is by far the slowest, wether handling unicode or not. It should be clear why it's not recommended for use.

Using unicode strings has a significant performance penalty. The docs state the following for differences between these types

For backward compatibility with Python 2 the S and a typestrings remain zero-terminated bytes and np.string_ continues to map to np.bytes_. To use actual strings in Python 3 use U or np.unicode_. For signed bytes that do not need zero-termination b or i1 can be used.

In this case, where the character set does not require unicode it would make sense to use the faster string_ type. If unicode was needed, you may get better performance by using a list, or a numpy array of type object_ if other numpy functionality is needed. Another good example of when a list may be better is appending lots of data

So, takeaways from this:

  1. Python, while generally accepted as slow, is very performant for some common things. Numpy is generally quite fast, but is not optimized for everything.
  2. Read the docs. If there's more than one way of doing things (and there usually is), odds are one way is better for what you're trying to do.
  3. Don't blindly assume that vectorized code will be faster - Always profile when you care about performance (this goes for any "optimization" tips).
user2699
  • 2,927
  • 14
  • 31