I have a set of data with 50K records of users (by email) and I need to choose only 10K of those records, by a predefined ratio of values in each category: Region, Role and Position.
For example, if I have the following sample of data (11 rows) how can I subset it to get 5 rows, split the following way:
- 80% AMER, 20% INDIA
- For each Role have 60% Sales and the rest would be at random
- For Position, get a split of 20% being Managers and 80% being Operational
Email Geo Role Position
abs@example.com AMER Sales Manager
sdf@example.com AMER Sales Operational
dsfe@example.com EMEA Sales Manager
sdw@example.com AMER Sales Operational
aydje@example.com EMEA Sales Manager
fdsed@example.com AMER Testing Operational
Sfe@example.com AMER Testing Manager
dfgt@example.com INDIA Testing Manager
gsdr@example.com INDIA Testing Operational
dmgru@example.com AMER Marketing Operational
edr@example.com INDIA Marketing Operational
I expect to get something like this:
Email Geo Role Position
abs@example.com AMER Sales Manager
sdf@example.com AMER Sales Operational
sdw@example.com AMER Sales Operational
fdsed@example.com AMER Testing Operational
edr@example.com INDIA Marketing Operational
I'm aware that there will be more than one right solution, especially with more data, but any one is fine, as long as the predefined ratios are respected.