7

Consider this dictionary of pandas series. The index on all series are integers and have some potential overlap, but certainly do not coincide. I made an observation that pd.concat seems slow when combining things along axis=1 when I have large indices, lots of non-overlap, and many items to concatenate. It prompted me to leave axis=0 and subsequently unstack(). I end up with the same exact result. But unstacking is quicker.

Does anyone have a handle on why this would be the case?

I get that concatenating series on top of one another should be quick, but I would have guessed that the unstacking processes would have been near identical as pd.concat(axis=1).

dict_of_series = {
    's%s' % i: pd.Series(
        1, np.unique(np.random.randint(1000, 10000, size=1000))
    ) for i in range(100)
}

%%timeit
pd.concat(dict_of_series, axis=0).unstack(0)

10 loops, best of 3: 29.6 ms per loop

%%timeit
pd.concat(dict_of_series, axis=1)

10 loops, best of 3: 43.1 ms per loop

piRSquared
  • 285,575
  • 57
  • 475
  • 624

1 Answers1

3

A quick profiling shows that a lot of time in the call pd.concat(dict_of_series, axis=1) is consumed by pd.Index.union. It is called on each index one by one inside an internal function _union_indexes - similarly as in my function union below.

On the other hand, pd.concat(dict_of_series, axis=0).unstack(0) makes use of some clever optimization, which I didn't try to analyze. Anyway, there is no call to _union_indexes or even pd.Index.union there.

Can '_union_indexes' be a culprit? Very likely. I took just the indexes:

idxs = [d.index for d in dict_of_series.values()]

and compared the following two functions:

def union(idxs):
    ans = pd.Index([])
    for idx in idxs:
        ans = ans.union(idx)
    return ans

def union_multi(idxs):
    arr0 = np.concatenate(idxs)
    arr1 = np.zeros(arr0.size)
    ans = pd.MultiIndex.from_arrays([arr0, arr1])
    return ans.levels[0]

union is an equivalent of _union_indexes, while union_multi has some multi-index overhead but avoids a call to pd.Index.union.

My rough timings:

>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=1)
1 loop, best of 1: 82.9 ms per loop
>>> %timeit -n1 -r1 pd.concat(dict_of_series, axis=0).unstack(0)
1 loop, best of 1: 57.9 ms per loop

>>> %timeit -n1 -r1 union(idxs)
1 loop, best of 1: 32.8 ms per loop
>>> %timeit -n1 -r1 union_multi(idxs)
1 loop, best of 1: 12.5 ms per loop

A simple pd.Index(np.unique(np.concatenate(idxs))) is slightly faster but it doesn't work for mixed types.

(BTW, pd.concat works only if all indexes are unique.)

ptrj
  • 5,152
  • 18
  • 31