2

SO Question which triggered my question: Pandas selecting by label sometimes return series, sometimes returns dataframe

I have a dataframe with users, products and clicks. In order to quickly look up which products are clicked by a user I make 'users' the index of the dataframe. I am selecting rows from this df by df.loc[u], but

  • If there is only one row => I get a Series
  • If there are multiple rows => I get a Dataframe

I always want to have a dataframe. The linked SO question has an accepted answer suggesting to make the lookup as follow: df.loc[[u]], this turns out to be a very bad idea though, as it is much less efficient.

In the following fragment I compared the runtimes for different access approaches:

import timeit
import numpy as np    
from functools import partial
n=1000
s=1000000
df = pd.DataFrame(np.random.randint(0,100,size=(s, 4)), columns=list('ABCD'))

#modify index to have some duplicates 
l = list(df.index) 
M = s//2
for i in range(10):
    l[M+i] = l[M]
df.index= l

def func1(df):
  return df.loc[20]

def func2(df):
   return df.loc[20:20]

def func3(df):
   return df.loc[df.index == 20]

def func4(df):
   return df.loc[[20]]

print(timeit.timeit( partial(func1, df=df ), number=n )/n)
print(timeit.timeit( partial(func2, df=df ), number=n )/n)
print(timeit.timeit( partial(func3, df=df ), number=n )/n)
print(timeit.timeit( partial(func4, df=df ), number=n )/n)

Gives very different runtimes:

0.0001941031701862812
0.0001424320638179779
0.0011041645780205727
0.10913500126451253

Any insights on why the performance difference is so big? Should I always use the second approach when working with a duplicate index? Or are there cases when another approach is preferred? Avoiding duplicate indexes altogether would probably be the ideal solution but the filter approach (func3) is also quite slow!

DDW
  • 1,975
  • 2
  • 13
  • 26

0 Answers0