23

I have identified one pandas command

timeseries.loc[z, x] = y

to be responsible for most of the time spent in an iteration. And now I am looking for better approaches to accelerate it. The loop covers not even 50k elements (and production goal is ~250k or more), but already needs a sad 20 seconds.

Here is my code (ignore the top half, it is just the timing helper)

def populateTimeseriesTable(df, observable, timeseries):
    """
    Go through all rows of df and 
    put the observable into the timeseries 
    at correct row (symbol), column (tsMean).
    """

    print "len(df.index)=", len(df.index)  # show number of rows

    global bf, t
    bf = time.time()                       # set 'before' to now
    t = dict([(i,0) for i in range(5)])    # fill category timing with zeros

    def T(i):
        """
        timing helper: Add passed time to category 'i'. Then set 'before' to now.
        """
        global bf, t 
        t[i] = t[i] + (time.time()-bf)
        bf = time.time()        

    for i in df.index:             # this is the slow loop
        bf = time.time()

        sym = df["symbol"][i]
        T(0)

        tsMean = df["tsMean"][i]
        T(1)

        tsMean = tsFormatter(tsMean)
        T(2)

        o = df[observable][i]
        T(3)

        timeseries.loc[sym, tsMean] = o
        T(4)

    from pprint import pprint
    print "times needed (total = %.1f seconds) for each command:" % sum(t.values())
    pprint (t)

    return timeseries

With (not important, not slow)

def tsFormatter(ts):
    "as human readable string, only up to whole seconds"
    return time.strftime("%Y-%m-%d %H:%M:%S", time.gmtime(ts))

. .

--> The to-be-optimized code is in the for-loop.

(T, and t are just helper function & dict, for the timing.)

I have timed every step. The vast majority of time:

len(df.index)= 47160
times needed (total = 20.2 seconds) for each command:
{0: 1.102,
 1: 0.741,
 2: 0.243,
 3: 0.792,
 4: 17.371}

is spent in the last step

timeseries.loc[sym, tsMean] = o

I have already downloaded and install pypy - but sadly, that doesn't support pandas yet.

Any ideas how to speed up populating a 2D array?

Thanks!


Edit: Sorry, hadn't mentioned - 'timeseries' is a dataframe too:

timeseries = pd.DataFrame({"name": titles}, index=index)
kevins_1
  • 1,268
  • 2
  • 9
  • 27
AltSheets
  • 357
  • 1
  • 3
  • 10
  • I don't know what kind of object `timeseries` is. But, if it has a '.loc` method, it may have a `.at` method. If you are assigning at a specific location, `.at` should be quicker. – piRSquared Jun 10 '16 at 23:24
  • Edit: Sorry, hadn't mentioned: timeseries is a dataframe: timeseries = pd.DataFrame({"name": titles}, index=index) I have added that to the OP now. – AltSheets Jun 11 '16 at 01:19
  • I will be looking into that .at function. Thanks a lot, @piRSquared – AltSheets Jun 11 '16 at 01:20

3 Answers3

33

UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.

=====================================================================

@jezrael has provided an interesting comparison and i decided to repeat it using more indexing methods and against 10M rows DF (actually the size doesn't matter in this particular case):

setup:

In [15]: df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))

In [16]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 5 columns):
a    float64
b    float64
c    float64
d    float64
e    float64
dtypes: float64(5)
memory usage: 381.5 MB

In [17]: df.shape
Out[17]: (10000000, 5)

Timing:

In [37]: %timeit df.loc[random.randint(0, 10**7), 'b']
1000 loops, best of 3: 502 µs per loop

In [38]: %timeit df.iloc[random.randint(0, 10**7), 1]
1000 loops, best of 3: 394 µs per loop

In [39]: %timeit df.at[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 66.8 µs per loop

In [41]: %timeit df.iat[random.randint(0, 10**7), 1]
10000 loops, best of 3: 32.9 µs per loop

In [42]: %timeit df.ix[random.randint(0, 10**7), 'b']
10000 loops, best of 3: 64.8 µs per loop

In [43]: %timeit df.ix[random.randint(0, 10**7), 1]
1000 loops, best of 3: 503 µs per loop

Results as a bar plot:

enter image description here

Timing data as DF:

In [88]: r
Out[88]:
       method  timing
0         loc   502.0
1        iloc   394.0
2          at    66.8
3         iat    32.9
4    ix_label    64.8
5  ix_integer   503.0

In [89]: r.to_dict()
Out[89]:
{'method': {0: 'loc',
  1: 'iloc',
  2: 'at',
  3: 'iat',
  4: 'ix_label',
  5: 'ix_integer'},
 'timing': {0: 502.0,
  1: 394.0,
  2: 66.799999999999997,
  3: 32.899999999999999,
  4: 64.799999999999997,
  5: 503.0}}

Plotting

ax = sns.barplot(data=r, x='method', y='timing')
ax.tick_params(labelsize=16)
[ax.annotate(str(round(p.get_height(),2)), (p.get_x() + 0.2, p.get_height() + 5)) for p in ax.patches]
ax.set_xlabel('indexing method', size=20)
ax.set_ylabel('timing (microseconds)', size=20)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 3
    wow wow wow. Thank you so much. This is truly enlightening. So my scepticism about such slowness was justified indeed. I had chosen perhaps the most comfortable solution, using .loc - but I bought it with a 7.5 to 15.3 fold punishment in time complexity, compared to .at and .iat - thanks a lot, this is very valuable information! – AltSheets Jun 11 '16 at 12:42
  • Done. Thx! Much appreciated. – AltSheets Jun 11 '16 at 13:16
  • Unfortunately, I cannot accept BOTH answers. Hmmmm ... yours has this really easy to read histogram output. But @jezrael was first to answer, and his answer is very good too. What to do now? – AltSheets Jun 11 '16 at 13:18
7

I always think at is the fastest, but not. ix is faster:

import pandas as pd

df = pd.DataFrame({'A':[1,2,3],
                   'B':[4,5,6],
                   'C':[7,8,9],
                   'D':[1,3,5],
                   'E':[5,3,6],
                   'F':[7,4,3]})

print (df)
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

print (df.at[2, 'B'])
6
print (df.ix[2, 'B'])
6
print (df.loc[2, 'B'])
6

In [77]: %timeit df.at[2, 'B']
10000 loops, best of 3: 44.6 µs per loop

In [78]: %timeit df.ix[2, 'B']
10000 loops, best of 3: 40.7 µs per loop

In [79]: %timeit df.loc[2, 'B']
1000 loops, best of 3: 681 µs per loop

EDIT:

I try MaxU df and differences are caused random.randint function:

df = pd.DataFrame(np.random.rand(10**7, 5), columns=list('ABCDE'))


In [4]: %timeit (df.ix[2, 'B'])
The slowest run took 25.80 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 20.7 µs per loop

In [5]: %timeit (df.ix[random.randint(0, 10**7), 'B'])
The slowest run took 9.42 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 28 µs per loop
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Very interesting. Thank you very much! Now I am wondering how to explain away the differences between your results, jezrael, and those which @MaxU has reported there: http://stackoverflow.com/a/37761363/6185262 – AltSheets Jun 11 '16 at 12:45
  • is it perhaps his df is 10M rows, and yours is small? So perhaps .ix is the fastest for small dataframes, and .at / .iat the fastest for huge ones? – AltSheets Jun 11 '16 at 12:46
  • I think timings are best for comparing of methods and in each PC are little different. And these timings not depens on length of dataframe. and `iat` is the fastest, but you need position of column - if dont know position and only column name, you cannot use iat - using ix is the fastest approach. – jezrael Jun 11 '16 at 12:53
  • The best is explain it in [docs](http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing). – jezrael Jun 11 '16 at 12:54
  • And if use [`anaconda`](https://docs.continuum.io/anaconda/install) - you can try check timings in your pc too. – jezrael Jun 11 '16 at 12:58
  • One idea - it is difference between my and maxu results because I use position as int (2) and maxu generate random int - so of course generating of number is some time and select is another time - together is time a bit higher as me (I dont generate random position.) – jezrael Jun 11 '16 at 13:01
  • I edit solution and add timings with generating `int` and not. – jezrael Jun 11 '16 at 13:13
  • ahhh, yes. That can explain, indeed. Thank you very much, jezrael. I hope I can accept both of your solutions as solutions to this question. – AltSheets Jun 11 '16 at 13:15
  • Oh no, I cannot. What a pity. – AltSheets Jun 11 '16 at 13:17
  • Yes, only one can be accepted. And it is up to you. Nice day. – jezrael Jun 11 '16 at 13:19
  • 6
    As both your answers are equally good ... I have thrown a thousand coins: sum([random.random() for _ in range(1000)])/1000 and that resulted in 0.4972785257694664 - so MaxU has won, by random choice :-) Nice day to you to! – AltSheets Jun 11 '16 at 13:24
  • Ok. Maybe next time I will be winner. ;-) – jezrael Jun 11 '16 at 13:27
  • @AltSheets, i like your approach! :) – MaxU - stand with Ukraine Jun 11 '16 at 13:36
1

if you are adding rows inside a loop consider thses performance issues; for around first 1000 to 2000 records "my_df.loc" performance is better and gradually it is become slower by increasing the number of records in loop.

If you plan to do thins inside a big loop(say 10M‌ records or so) you are better to use a mixture of "iloc" and "append"; fill a temp datframe with iloc untill the size gets around 1000, then append it to the original dataframe, and empy the temp dataframe. this would boost your performance around 10 times

Afshin Amiri
  • 3,438
  • 1
  • 20
  • 21