Subsample pandas dataframe

Question

I have a DataFrame loaded from a .tsv file. I wanted to generate some exploratory plots. The problem is that the data set is large (~1 million rows), so there are too many points on the plot to see a trend. Plus, it is taking a while to plot.

I wanted to sub-sample 10000 randomly distributed rows. This should be reproducible so the same sequence of random numbers is generated in each run.

This: Sample two pandas dataframes the same way seems to be on the right track, but I cannot guarantee the subsample size.

Does it have to be random? You can eg also take every thousandth point. And why can't you guarantee the subsample size? You say you want a subsample of 10000. — joris, Sep 10 '13 at 08:26
yeah, I can take every (1/N)th row to subsample to get N points. But I wanted to know how we can go around if we needed randomly selected points. The other thing is if the data is oscillating with frequency equal to N. I will end up picking the data at exact same point. — Nishant, Sep 10 '13 at 08:33
OK, good reason. But what is wrong with te solution you linked to? You can set the size of `random.randint` to a certain part of the length of you dataframe if you can't guarantee the size. — joris, Sep 10 '13 at 08:35
If I read the solution correctly, it seems that I cannot control how many number of records are going to be in sub-sample. Because, I cannot control the numbers of `True` generated. I guess there should be a a way to generate the sub-sample index by using `numpy.random.randint()` without replacement. But I do not know, how. — Nishant, Sep 10 '13 at 08:38
Ah yes, my fault. I should gave read it better. See my answer. — joris, Sep 10 '13 at 08:55

score 19 · Accepted Answer · edited Aug 24 '21 at 05:21

19

You can select random elements from the index with np.random.choice. Eg to select 5 random rows:

df = pd.DataFrame(np.random.rand(10))

df.loc[np.random.choice(df.index, 5, replace=False)]

This function is new in 1.7. If you want a solution with an older numpy, you can shuffle the data and taken the first elements of that:

df.loc[np.random.permutation(df.index)[:5]]

In this way your DataFrame is not sorted anymore, but if this is needed for plotting (for example, a line plot), you can simply do .sort() afterwards.

edited Aug 24 '21 at 05:21

Keivan

1,300
1
16
29

answered Sep 10 '13 at 08:54

joris

133,120
36
247
202

thanks. I realized that I was using wrong bracket `ix = numpy.random.choice(10, size=5, replace=False, p=None)` and `df = train1.loc(ix)` :) – Nishant Sep 10 '13 at 09:01

Andy Hayden · Answer 2 · 2013-09-10T10:20:56.200

Unfortunately np.random.choice appears to be quite slow for small samples (less than 10% of all rows), you may be better off using plain ol' sample:

from random import sample
df.loc[sample(df.index, 1000)]

For large DataFrame (a million rows), we see small samples:

In [11]: %timeit df.loc[sample(df.index, 10)]
1000 loops, best of 3: 1.19 ms per loop

In [12]: %timeit df.loc[np.random.choice(df.index, 10, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [13]: %timeit df.loc[np.random.permutation(df.index)[:10]]
1 loops, best of 3: 1.38 s per loop

In [21]: %timeit df.loc[sample(df.index, 1000)]
10 loops, best of 3: 14.5 ms per loop

In [22]: %timeit df.loc[np.random.choice(df.index, 1000, replace=False)]
1 loops, best of 3: 1.28 s per loop    

In [23]: %timeit df.loc[np.random.permutation(df.index)[:1000]]
1 loops, best of 3: 1.3  s per loop

But around 10% it gets about the same:

In [31]: %timeit df.loc[sample(df.index, 100000)]
1 loops, best of 3: 1.63 s per loop

In [32]: %timeit df.loc[np.random.choice(df.index, 100000, replace=False)]
1 loops, best of 3: 1.36 s per loop

In [33]: %timeit df.loc[np.random.permutation(df.index)[:100000]]
1 loops, best of 3: 1.4 s per loop

and if you are sampling everything (don't use sample!):

In [41]: %timeit df.loc[sample(df.index, 1000000)]
1 loops, best of 3: 10 s per loop

Note: both numpy.random and random accept a seed, to reproduce randomly generated output.

As @joris points out in the comments, choice (without replacement) is actually sugar for permutation so it's no suprise it's constant time and slower for smaller samples...

@joris What I find really surprising is it seems choice is no faster than permutation! — Andy Hayden, Sep 10 '13 at 09:47
But it seems it depends on the ratio you subsample from the total, because the numpy solution only depends on the total size, and not on the subsample size. So if the subsample gets larger, both solution get more on par. If I try it with a subsample of 1/10, they seem as fast as random.sample. — joris, Sep 10 '13 at 09:48
@joris it feels like it should depend on the ratio which algo to choose... it just seems crazy how bad choice is for small samples. — Andy Hayden, Sep 10 '13 at 09:52
@AndyHayden Aha, this ``idx = self.permutation(pop_size)[:size]`` in the source code of `choice` clarifies a lot :-) — joris, Sep 10 '13 at 10:01
Got ```TypeError: Population must be a sequence or set. For dicts, use list(d).```. Fixed with ```list(df.index)```. — hafiz031, Jun 25 '22 at 04:25

score 13 · Answer 3 · answered Jun 09 '16 at 20:22

13

These days, one can simply use the sample method on a DataFrame:

>>> help(df.sample)
Help on method sample in module pandas.core.generic:

sample(self, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None) method of pandas.core.frame.DataFrame instance
    Returns a random sample of items from an axis of object.

Replicability can be achieved by using the random_state keyword:

>>> len(set(df.sample(n=1, random_state=np.random.RandomState(0)).iterations.values[0] for _ in xrange(1000)))
1
>>> len(set(df.sample(n=1).iterations.values[0] for _ in xrange(1000)))
40

answered Jun 09 '16 at 20:22

Alex Coventry

68,681
4
36
40

Any idea why the `random_state` parameter is failing to produce replicability? I'm executing multiple times and getting different ordering every time with `data = data.sample(n = len(data), random_state = np.random.RandomState(1337))` – Greg Hilston Oct 31 '17 at 23:26
Found my problem after spending longer than I want to admit on this. Commenting for hopefully someone in the future stuck as well. This code will produce the same result: `data.sample(n = len(data), random_state = np.random.RandomState(1337))` but setting it to a new DataFrame will not. See `frac` to replace the `len` stuff and `replace` to clean it up some more. – Greg Hilston Oct 31 '17 at 23:32

Subsample pandas dataframe

3 Answers3