Taking a proportion of a dataframe based on column values

Question

I have a Pandas dataframe with ~50,000 rows and I want to randomly select a proportion of rows from that dataframe based on a number of conditions. Specifically, I have a column called 'type of use' and, for each field in that column, I want to select a different proportion of rows.

For instance:

df[df['type of use'] == 'housing'].sample(frac=0.2)

This code returns 20% of all the rows which have 'housing' as their 'type of use'. The problem is I do not know how to do this for the remaining fields in a way that is 'idiomatic'. I also do not know how I could take the result from this sampling to form a new dataframe.

you need to iterate over all unique filter options and store in a dictionary of dataframes, check answer below. — anky, Jan 10 '19 at 14:13

anky · Accepted Answer · 2019-01-10T14:23:48.120

1

You can make a unique list for all the values in the column by list(df['type of use'].unique()) and iterate like below:

for i in list(df['type of use'].unique()):
    print(df[df['type of use'] == i].sample(frac=0.2))

or

i = 0 
while i < len(list(df['type of use'].unique())):
    df1 = df[(df['type of use']==list(df['type of use'].unique())[i])].sample(frac=0.2)
    print(df1.head())
    i = i + 1

For storing you can create a dictionary:

dfs = ['df' + str(x) for x in list(df2['type of use'].unique())]
dicdf = dict()
i = 0 
while i < len(dfs):
    dicdf[dfs[i]] = df[(df['type of use']==list(df2['type of use'].unique())[i])].sample(frac=0.2)
    i = i + 1
print(dicdf)

This will print a dictionary of the dataframes. You can print what you like to see for example for housing sample : print (dicdf['dfhousing'])

edited Jan 10 '19 at 14:23

answered Jan 10 '19 at 12:52

anky

74,114
11
41
70

Thanks a lot, @anky_91. This is generally what I was trying to achieve. I am thinking about 'replicating' my code in SQL for something else - do you know if there is an SQL equivalent to df[df['type of use'] == i].sample(frac=0.2), for example? – EarlofMar Jan 10 '19 at 17:12
May be this - https://stackoverflow.com/questions/249301/simple-random-samples-from-a-sql-database?? – anky Jan 10 '19 at 17:53

P E · Answer 2 · 2021-11-09T09:31:28.600

Sorry this is coming in 2+ years late, but I think you can do this without iterating, based on help I received to a similar question here. Applying it to your data:

import pandas as pd
import math
percentage_to_flag = 0.2 #I'm assuming you want the same %age for all 'types of use'?

#First, create a new 'helper' dataframe:
random_state = 41  # Change to get different random values.
df_sample = df.groupby("type of use").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
df_sample = df_sample.reset_index(level=0, drop=True)  #may need this to simplify multi-index dataframe

# Now, mark the random sample in a new column in the original dataframe:
df["marked"] = False
df.loc[df_sample.index, "marked"] = True

Taking a proportion of a dataframe based on column values

2 Answers2

Linked