Pandas: What is the fastest way to search a large dataframe

Question

pandas newbie question:

I have a dataframe with millions of rows, a sample output would be:

c_id  c1   c2
0     10  100
0     15  110
0     15  112
2     96  120
56    43  42

for each customer_id, i want to create a table do some stuff to it. What's the best way to do it? I sorted the dataframe by c_id, then set the index to it:

df = df.sort('c_id', ascending=False)
df = df.set_index('c_id')

but a simple operation like:

temp_df = df.loc[:0]

takes forever, what's the fastest way to approach this problem? I thought a sorted set_index would do the trick. I guess not.

EDIT1:

I want to get the list of all the unique values of c1, for each value of c_id. so something like:

df.loc[:0].c1.unique()

there might be quite a few different approaches depending on "stuff you want to do with subsets of your DF". Try to explain what are you trying to achieve and post your desired data set... — MaxU - stand with Ukraine, May 26 '17 at 15:08
it's non-performant to have a non-unique index, you'd better off just `group`ing on the `c_id`, you can then just do `gp.get_group(your_c_id)` to return you a specific group but you'd need to do some aggregation on the `groupby` object in order to return a series/df — EdChum, May 26 '17 at 15:09

FLab · Accepted Answer · 2017-05-26T16:57:10.843

Don't explicitly create groups, but use pandas groupby.

For example, say that you want to find the average value for client, you can do:

df.groupby(by = df['c_id']).mean()

and so on.

You can also apply (almost) arbitrary transformations, using .apply and .transform methods (although in-built methods like mean, std, min, max is much more efficient, as they are optimised).

To answer your specific question, you can do:

df.groupby('c_id').c1.nunique()

which gives:

c_id
0     2
20    1
56    1
Name: c1, dtype: int64

Notice that some questions (this and this) suggest that .nunique is not the faster way to go and this is the alternative way to go:

df.groupby('c_id').c1.apply(lambda x: len(x.unique()))

(I haven't done any benchmarking myself...)

pretty much the right answer. What I was looking for was : df.groupby('c_id').c1.unique() Did not know pandas can hold an array. — user1871528, May 27 '17 at 13:18

Pandas: What is the fastest way to search a large dataframe

1 Answers1