How to downsample dataframe rows uniformly based on columns distinct values?

Question

I'm trying to downsample dataframe rows in order to create a smaller dataframe. Let's assume our dataframe has several columns and each column has predefined categorical values. How to make sure every distinct categorical value has a chance of presence in new resampled dataframe?

For example:
rows = [{'A':'a', 'B':'d', 'C':'g'},{'A':'a', 'B':'e', 'C':'h'},{'A':'a', 'B':'d', 'C':'g'},{'A':'c', 'B':'f', 'C':'i'},{'A':'c', 'B':'d', 'C':'g'},{'A':'b', 'B':'e', 'C':'h'}] pd.DataFrame(rows)
out put of the code

In column 'A' we have 'a', 'b' and 'c' values. How to make sure after resampling non of these values are lost?

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Sep 03 '19 at 06:06
Possible duplicate of [Pandas: Sampling a DataFrame](https://stackoverflow.com/questions/12190874/pandas-sampling-a-dataframe) — Ankur Sinha, Sep 03 '19 at 06:07
@jezrael - Thanks for your hints. I edited the question in order to clarify the problem. — Arian Shariat, Sep 03 '19 at 07:47
@qwerty - I came up with categorizing columns using `groupby` and take samples from each group but I found it too complicated as duplication could happen. — Arian Shariat, Sep 03 '19 at 07:51

M_S_N · Answer 1 · 2019-09-03T07:02:51.767

1

You can use:

import numpy as np
import pandas as pd
data = pd.DataFrame({'col': np.repeat(['A', 'B', 'C'], 12),
                     'value1': np.repeat([1,0,1],12),
                     'value2': np.random.randint(20, 100, 36)})
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

start_ix = np.random.choice(data1.index[:-3])

print(data.loc[start_ix:start_ix+3])

edited Sep 03 '19 at 07:02

answered Sep 03 '19 at 06:26

M_S_N

2,764
1
17
38

I'm not sure but I guess `df.sample(frac=1)` only shuffles rows of dataframe. Does `df.sample()` handle sampling uniformly from values of all columns? – Arian Shariat Sep 03 '19 at 06:52
1

This one is same as `df.sample()` and does not guaranty every categorical entry is in new dataframe. – Arian Shariat Sep 03 '19 at 07:44

How to downsample dataframe rows uniformly based on columns distinct values?

1 Answers1