It's difficult to know exactly what's going on but I suspect a combination of an incorrect use of sample
and duplicated indices.
Why would you sample
rows, then get the index of the output, then slice again the original dataframe with it?
Let's see what could go wrong.
sample
already gives you a DataFrame. It is useless to index again:
df = pd.DataFrame({'A': range(10),
'B': range(10)})
print(df)
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
# now let's sample
out = df.sample(frac=0.3)
print(out)
A B
9 9 9
1 1 1
0 0 0
# now let's index again
print(out.loc[out.index])
A B
9 9 9
1 1 1
0 0 0
The second step is clearly useless, but not much harm done.
Now let's assume that you have duplicated indices in the input:
A B
0 0 0
0 1 1
0 2 2
0 3 3
0 4 4
0 5 5
0 6 6
0 7 7
0 8 8
0 9 9
If we just sample
everything is fine:
out = df.sample(frac=0.3)
print(out)
A B
0 5 5
0 9 9
0 2 2
But if we index from that, now it's bad, all rows are selected as many times as there are duplicates. In this example for n
rows in the sampled intermediate you get n**2 rows. That's pretty big for large inputs, and could be the cause of your timeout:
print(out.loc[out.index])
A B
0 5 5
0 9 9
0 2 2
0 5 5
0 9 9
0 2 2
0 5 5
0 9 9
0 2 2