2

I came across this in the context of random samples from a series of characters. Let me know if it belongs somewhere else.

If I generate a list of 1000 characters:

my_list = random.choices('abc', k=1000)

and a numpy array of 1000 characters:

my_array = np.array(my_list)

And then join them each into a single long string and time it, I get:

''.join(my_list)  # vanilla Python list
# 7.69 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

''.join(my_array)  # numpy array
# 257 µs ± 3.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why the huge discrepancy in run time?

My thoughts

My first thought is that the str.join method is converting the numpy array into a list before joining it. No such luck. If I time converting my numpy array into a list I do not get anywhere near making up the difference between the two operations. ndarray.tolist() is very fast!

my_array.tolist()
11.6 µs ± 87 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So I suppose str.join is just super optimized for Python lists and not for numpy arrays? This seems odd since numpy arrays and operations are usually very optimized and are often much faster than equivalent vanilla Python implementations (see example below).

I'm not sure where to find the actual implementation of str.join, so I have been unable to use the source (Luke).

Aside: generating the random characters with numpy

If I generate my random characters with numpy:

my_array = np.random.choice([char for char in 'abc'], size=1000, replace=True)
# 29.8 µs ± 2.33 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

and compare the run time of this with random.choices, I get numpy being an order of magnitude faster than random.choices, and other options from random seem to be slower still (more details here):

my_list = random.choices('abc', k=1000)
# 276 µs ± 6.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This means that if I am generating and joining my lists in one shot for my application, the two approaches work out to be about the same speed. But it got me wondering why numpy fails to maintain the lead as soon as I bring the join operation into the mix.

Engineero
  • 12,340
  • 5
  • 53
  • 75
  • I doubt this has much to do with strings in particular, and moreso just the fact that `''.join` will be doing item-by-item iteration over the NumPy array, which is markedly slower than item-by-item iteration over a Python list. You can benchmark `for ele in my_list: pass` and `for ele in my_array: pass` to see this. – miradulo May 18 '18 at 14:27
  • That makes sense, but *why* is iteration over a numpy array so much slower when numpy is otherwise so well optimized? – Engineero May 18 '18 at 14:28
  • 1
    Some [thoughts here](https://stackoverflow.com/a/35232645/4686625) on mgilson's answer. – miradulo May 18 '18 at 14:28
  • 1
    Your first thought is probably correct, `list(my_array)` gives a rather slow time, instead of `my_array.tolist()` which the join method won't use. – user2699 May 18 '18 at 14:32
  • 1
    Thanks, @miradulo, I think that pretty well answers it. I'll probably delete this or mark it as a duplicate of that question, since that seems to be what it comes down to. – Engineero May 18 '18 at 14:33
  • 1
    In general numpy is slow when you iterate over elements in a python loop and very fast when you use numpy operations. ``tolist()`` is an numpy operation and probably optimized, while ``list()`` works on the iterator of the object. Probably there is even a numpy-only option to join the values of an array (some ``reduce`` like operation). – allo May 18 '18 at 14:35

0 Answers0