1

I'm trying to downsample dataframe rows in order to create a smaller dataframe. Let's assume our dataframe has several columns and each column has predefined categorical values. How to make sure every distinct categorical value has a chance of presence in new resampled dataframe?

For example:
rows = [{'A':'a', 'B':'d', 'C':'g'},{'A':'a', 'B':'e', 'C':'h'},{'A':'a', 'B':'d', 'C':'g'},{'A':'c', 'B':'f', 'C':'i'},{'A':'c', 'B':'d', 'C':'g'},{'A':'b', 'B':'e', 'C':'h'}] pd.DataFrame(rows)
out put of the code

In column 'A' we have 'a', 'b' and 'c' values. How to make sure after resampling non of these values are lost?

Arian Shariat
  • 35
  • 2
  • 6
  • Please share what you have tried and where you are stuck. – qwerty Sep 03 '19 at 06:03
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Sep 03 '19 at 06:06
  • Possible duplicate of [Pandas: Sampling a DataFrame](https://stackoverflow.com/questions/12190874/pandas-sampling-a-dataframe) – Ankur Sinha Sep 03 '19 at 06:07
  • @jezrael - Thanks for your hints. I edited the question in order to clarify the problem. – Arian Shariat Sep 03 '19 at 07:47
  • @qwerty - I came up with categorizing columns using `groupby` and take samples from each group but I found it too complicated as duplication could happen. – Arian Shariat Sep 03 '19 at 07:51

1 Answers1

1

You can use:

import numpy as np
import pandas as pd
data = pd.DataFrame({'col': np.repeat(['A', 'B', 'C'], 12),
                     'value1': np.repeat([1,0,1],12),
                     'value2': np.random.randint(20, 100, 36)})
data1 = data[data.file == np.random.choice(data['file'].unique())].reset_index(drop=True)

start_ix = np.random.choice(data1.index[:-3])

print(data.loc[start_ix:start_ix+3])

M_S_N
  • 2,764
  • 1
  • 17
  • 38
  • I'm not sure but I guess `df.sample(frac=1)` only shuffles rows of dataframe. Does `df.sample()` handle sampling uniformly from values of all columns? – Arian Shariat Sep 03 '19 at 06:52
  • 1
    This one is same as `df.sample()` and does not guaranty every categorical entry is in new dataframe. – Arian Shariat Sep 03 '19 at 07:44