SO Question which triggered my question: Pandas selecting by label sometimes return series, sometimes returns dataframe
I have a dataframe with users, products and clicks. In order to quickly look up which products are clicked by a user I make 'users' the index of the dataframe. I am selecting rows from this df by df.loc[u], but
- If there is only one row => I get a Series
- If there are multiple rows => I get a Dataframe
I always want to have a dataframe. The linked SO question has an accepted answer suggesting to make the lookup as follow: df.loc[[u]], this turns out to be a very bad idea though, as it is much less efficient.
In the following fragment I compared the runtimes for different access approaches:
import timeit
import numpy as np
from functools import partial
n=1000
s=1000000
df = pd.DataFrame(np.random.randint(0,100,size=(s, 4)), columns=list('ABCD'))
#modify index to have some duplicates
l = list(df.index)
M = s//2
for i in range(10):
l[M+i] = l[M]
df.index= l
def func1(df):
return df.loc[20]
def func2(df):
return df.loc[20:20]
def func3(df):
return df.loc[df.index == 20]
def func4(df):
return df.loc[[20]]
print(timeit.timeit( partial(func1, df=df ), number=n )/n)
print(timeit.timeit( partial(func2, df=df ), number=n )/n)
print(timeit.timeit( partial(func3, df=df ), number=n )/n)
print(timeit.timeit( partial(func4, df=df ), number=n )/n)
Gives very different runtimes:
0.0001941031701862812
0.0001424320638179779
0.0011041645780205727
0.10913500126451253
Any insights on why the performance difference is so big? Should I always use the second approach when working with a duplicate index? Or are there cases when another approach is preferred? Avoiding duplicate indexes altogether would probably be the ideal solution but the filter approach (func3) is also quite slow!