2

I have this code:

from datetime import date, timedelta
from time import time
import pandas as pd

sizes = [500]

base_date = date(2016,10,31)

for n in sizes:
    dates = [base_date - timedelta(days = x) for x in range(1, n, 1)]
    dates_df = pd.DataFrame({'DATE' : dates, 'key' : 1})
    identifiers = range(1, 5000)
    identifiers_df = pd.DataFrame({'IDENTIFIER' : identifiers, 'key' : 1})

    df = pd.merge(dates_df, identifiers_df, on='key')
    df = df.set_index(['DATE', 'IDENTIFIER'])
    df = df.sort_index(axis = 0, level = ['DATE', 'IDENTIFIER'], ascending=False)

    start_time = time()
    for d in dates:
        temp = df.ix[d]

    end_time = time()

    print ('%s %s' % (n, end_time - start_time))

the final print of this from pandas 0.12 is 0.15 seconds, however with pandas 0.18 this runs for 8.5 seconds. Any idea of why this difference in behavior? Also, it looks like Pandas 0.12 uses random access, while 0.18 does not, because the printed time is also a function of the size selected for 0.18.

As suggested in a comment below, I have tried to profile the previous code with cProfile, and the major difference between the two seems to be in the call of getitem:

Pandas 0.18
ncalls  tottime percall cumtime percall filename:lineno(function)
998/499 0.006   0       6.027   0.012   indexing.py:1286(__getitem__)

Pandas 0.12
ncalls  tottime percall cumtime percall filename:lineno(function)
499     0.001   0       0.163   0       indexing.py:695(__getitem__)

Thank you so much in advance for all your help! Giuliano

Giuliano
  • 53
  • 5
  • Have you tried profiling? http://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script – C8H10N4O2 Oct 31 '16 at 14:21

1 Answers1

0

In later versions of pandas, iloc is preferred and more optimized than ix, that will be deprecated soon. Try porting your code with iloc and check the performance.

Zeugma
  • 31,231
  • 9
  • 69
  • 81
  • ix performs label indexing (and can fall back on positional), while loc is only label indexing. iloc however is purely positional. It seems that you are implying that multi-indexing, or indexing by dates (typical for time series analysis) should disappear in future versions? If that is the case, can you please send a link to confirm that statement? – Giuliano Nov 01 '16 at 10:45
  • Thank you for the link! I changed the code to use loc, instead of ix, and it is much slower in 0.18 than 0.12, 15 secs vs 1.9 secs. – Giuliano Nov 01 '16 at 11:07