I have this code:
from datetime import date, timedelta
from time import time
import pandas as pd
sizes = [500]
base_date = date(2016,10,31)
for n in sizes:
dates = [base_date - timedelta(days = x) for x in range(1, n, 1)]
dates_df = pd.DataFrame({'DATE' : dates, 'key' : 1})
identifiers = range(1, 5000)
identifiers_df = pd.DataFrame({'IDENTIFIER' : identifiers, 'key' : 1})
df = pd.merge(dates_df, identifiers_df, on='key')
df = df.set_index(['DATE', 'IDENTIFIER'])
df = df.sort_index(axis = 0, level = ['DATE', 'IDENTIFIER'], ascending=False)
start_time = time()
for d in dates:
temp = df.ix[d]
end_time = time()
print ('%s %s' % (n, end_time - start_time))
the final print of this from pandas 0.12 is 0.15 seconds, however with pandas 0.18 this runs for 8.5 seconds. Any idea of why this difference in behavior? Also, it looks like Pandas 0.12 uses random access, while 0.18 does not, because the printed time is also a function of the size selected for 0.18.
As suggested in a comment below, I have tried to profile the previous code with cProfile, and the major difference between the two seems to be in the call of getitem:
Pandas 0.18
ncalls tottime percall cumtime percall filename:lineno(function)
998/499 0.006 0 6.027 0.012 indexing.py:1286(__getitem__)
Pandas 0.12
ncalls tottime percall cumtime percall filename:lineno(function)
499 0.001 0 0.163 0 indexing.py:695(__getitem__)
Thank you so much in advance for all your help! Giuliano