Sampling from a large dataset

Question

I have a dataset with 112k rows and 2 columns. How do I sample equaly from this dataset to obtain a small data set of like 10k rows?

I mean equaly because this dataset has 56k rows with a column name True=1 and 56k rows with column ``´True=0```.

So i want to sample 10k row with 5k with column True=1 and 5k with True=0.

Thanks

Can you share a small sample, did you try using pandas `sample`? — David, May 19 '21 at 15:44

Stuart · Accepted Answer · 2021-05-20T13:01:56.913

This is called a stratified random sample with equal allocation (i.e. the sample from each group is the same size) which happens in this case also to be proportional allocation (the sample from each group is proportional to the size of the group).

It can be achieved using groupby.sample:

df.groupby("my_column").sample(n=5000)

There are a few earlier questions on this topic but they relate to slightly more complicated cases and seem to have been answered before the groupby.sample method was introduced in pandas v1.1.

score 2 · Answer 2 · answered May 19 '21 at 15:57

For dataset split StratifiedKFold would help

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

USAGE

>>> from sklearn.model_selection import StratifiedKFold
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([0, 0, 1, 1])
>>> skf = StratifiedKFold(n_splits=2)
>>> skf.get_n_splits(X, y)
2
>>> print(skf)  
StratifiedKFold(n_splits=2, random_state=None, shuffle=False)
>>> for train_index, test_index in skf.split(X, y):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]

score 0 · Answer 3 · answered Jul 20 '22 at 07:25

0

An alternative solution:

a = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
step_size = 5
a[0::step_size]

Output: [1, 6, 11, 16]

sample_count = 5
a[0::int(len(a)/sample_count)]

Output: [1, 5, 9, 13, 17]

answered Jul 20 '22 at 07:25

leenremm

1,083
13
19

Sampling from a large dataset

3 Answers3