How to sample efficiently from a large Pandas Dataframe?

Question

I have a dataframe called X_it with shape (2667913, 42)

I'm trying to sample from that dataframe by using the below code:

import numpy as np

np.random.seed(42)

sel_idx = X_it.sample(frac=0.1).index

X = X_it.loc[sel_idx]

The final line of code hangs indefinitely. Is there any better way of doing it?

What are you trying to do? `X_it.sample(frac=0.1)` will return a sample of 10% of the data frame, which in this case will be 266791 rows. If you want those rows, why not just take that result rather than getting the indices and indexing them from the data frame? — jared, Jul 28 '23 at 04:25
@jared How do I implement your suggestion? I believe you are asking me to do X = X_it.sample(frac=0.1) Got it... It is just that I have another dataframe Y from which I need to extract rows at the same index. Concatenating could be the answer.. — procrastinationmonkey, Jul 28 '23 at 05:31
@user1956069 That's an important clarification which you should include in your question. — jared, Jul 28 '23 at 05:55
@procrastinationmonkey that's what I thought. Then your index is the culprit. How to solve your issue depends on how your data is organized in your two datasets. Can you provide a sample of both? One option is to `reset_index(drop=True)`, but this depends on how you plan to align the two datasets. — mozway, Jul 28 '23 at 18:48
@mozway I fixed it using reset_index. Both X and Y are derived from a single dataframe X_Y_it. I did the drop there. Thank you very much, you helped deliver something at work :) — procrastinationmonkey, Jul 29 '23 at 00:14
@procrastinationmonkey you're welcome. Already aligned indices is indeed the most favorable case ;) — mozway, Jul 29 '23 at 03:31

score 1 · Accepted Answer · answered Jul 28 '23 at 04:22

It's difficult to know exactly what's going on but I suspect a combination of an incorrect use of sample and duplicated indices.

Why would you sample rows, then get the index of the output, then slice again the original dataframe with it?

Let's see what could go wrong.

sample already gives you a DataFrame. It is useless to index again:

df = pd.DataFrame({'A': range(10),
                   'B': range(10)})
print(df)

   A  B
0  0  0
1  1  1
2  2  2
3  3  3
4  4  4
5  5  5
6  6  6
7  7  7
8  8  8
9  9  9

# now let's sample
out = df.sample(frac=0.3)
print(out)

   A  B
9  9  9
1  1  1
0  0  0


# now let's index again
print(out.loc[out.index])

   A  B
9  9  9
1  1  1
0  0  0

The second step is clearly useless, but not much harm done.

Now let's assume that you have duplicated indices in the input:

If we just sample everything is fine:

out = df.sample(frac=0.3)
print(out)

   A  B
0  5  5
0  9  9
0  2  2

But if we index from that, now it's bad, all rows are selected as many times as there are duplicates. In this example for n rows in the sampled intermediate you get n**2 rows. That's pretty big for large inputs, and could be the cause of your timeout:

print(out.loc[out.index])
   A  B
0  5  5
0  9  9
0  2  2
0  5  5
0  9  9
0  2  2
0  5  5
0  9  9
0  2  2

I just commented about a similar confusion. It's not clear why they don't just take the return of sample rather than using that result to then index the data frame again. — jared, Jul 28 '23 at 04:27
@jared yes I just saw it. I suspect this is the issue. I requested the output of `df.index.duplicated().sum()` to give us a hint ;) — mozway, Jul 28 '23 at 04:28
Now I understand what you mean.. "Why would you sample rows, then get the index of the output, then slice again the original dataframe with it?" That's because I have another matrix Y, from which I need the same rows.. I think I can just concatenate the 2 matrices and just sample and then split the 2 matrices.. — procrastinationmonkey, Jul 28 '23 at 05:27
@user1956069 that's a good reason, but you will have the same issue with `concat`/`merge` if you have duplicated indices. How do you match the rows? By position? key in a column? You might need to `reset_index`. — mozway, Jul 28 '23 at 05:42

score 0 · Answer 2 · answered Jul 28 '23 at 07:47

So as in the comments you said that you have to apply the same random selection to another dataframe, it would be more efficient to just build a index list directly with numpy. This saves you an unnecessary detour with the pandas indices.

import numpy as np

np.random.seed(42)

sample_size = int(0.1 * len(X_it))
sel_idx = np.random.choice(X_it.index, size=sample_size, replace=False)

X = X_it.loc[sel_idx]

This sel_idx you can then use on your second dataframe directly.

How to sample efficiently from a large Pandas Dataframe?

2 Answers2