Numpy's str
dtype and its operations are not optimized, so it's probably best to stick to using object
dtype when working with strings with numpy.
str
consumes more memory than object
Depending on the length of the fixed-length string and the size of the array, the ratio differs but as long as the longest string in the array is longer than 2 characters, str
consumes more memory (they are equal when the longest string in the array is 2 characters long). For example, in the following example, str
consumes almost 8 times more memory.
from pympler.asizeof import asizesof
ar1 = np.array(['this is a string', 'string']*1000, dtype=object)
ar2 = np.array(['this is a string', 'string']*1000, dtype=str)
asizeof(ar2) / asizeof(ar1) # 7.944444444444445
str
is slower than object
Numpy's vectorized string methods are not optimized, so operating on the object
array is often faster. For example, in the example in the OP where each character is repeated, a simple *
(aka multiply()
) is not only more concise but also over 10 times faster than char.multiply()
.
import timeit
setup = "import numpy as np; from __main__ import ar1, ar2"
t1 = min(timeit.repeat("ar1*2", setup, number=1000))
t2 = min(timeit.repeat("np.char.multiply(ar2, 2)", setup, number=1000))
t2 / t1 # 10.650433758517027
Even for functions that cannot be readily be applied on the array, instead of the vectorized char
method of str
arrays, it is faster to loop over the object
array and work on the Python strings.
For example, iterating over the object
array and calling str.count()
on each Python string is over 3 times faster than the vectorized char.count()
on the str
array.
f1 = lambda: np.array([s.count('i') for s in ar1])
f2 = lambda: np.char.count(ar2, 'i')
setup = "import numpy as np; from __main__ import ar1, ar2, f1, f2, f3"
t3 = min(timeit.repeat("f1()", setup, number=1000))
t4 = min(timeit.repeat("f2()", setup, number=1000))
t4 / t3 # 3.251369161574832
On a side note, if it comes to explicit loop, iterating over a list is faster than iterating over a numpy array. So in the previous example, a further performance gain can be made by iterating over the list
f3 = lambda: np.array([s.count('i') for s in ar1.tolist()])
# ^^^^^^^^^ <--- convert to list here
t5 = min(timeit.repeat("f3()", setup, number=1000))
t3 / t5 # 1.2623498005294627