0

I am trying to sample_n by age group (Bage), gender, and employment to create a new column with ethnicity. I've found a way to do it but for each sample there is 9 lines of code, and the size changes each time as I am distributing different numbers of people depending on their ethnic group.

The example below shows the code for randomly distributing unemployed males in the 16-24 age group with an ethnic group defined by the census as 'Other'. The example data is taken from the full dataset. Following from this I would then repeat all lines of code (changing the specifics; bage,gender,employment,size) for all employment types and ethnicities, hence it is a long, slow process. I've looked at creating loops or functions but I'm not really getting anywhere as keep getting stuck because different size samples I need, rather than the same sample size through the whole dataset.

Any advice on reducing the length of code and time to do this would be greatly appreciated.

Sample input data: showing age group 16-24 (Bage==16), and Males for some employment types:

       ID    Ages    Bage  Gender     Employment   Ethnicity
77     16     16     16     Male           PT          
78     78     16     16     Male           PT          
79     79     16     16     Male           PT          
80     80     16     16     Male           PT           
81     81     16     16     Male           PT          
82     82     16     16     Male           PT          
83     83     16     16     Male           PT                  
91     91     16     16     Male           PT          
92     92     16     16     Male           PT          
93     93     16     16     Male           PT          
94     94     16     16     Male           PT     
95     95     16     16     Male           PT     
96     96     16     16     Male           PT     
97     97     16     16     Male           PT     
98     98     16     16     Male           PT     
99     99     16     16     Male           PT     
100   100     16     16     Male           PT     
101   101     16     16     Male           PT     
102   102     16     16     Male           PT        
127   127     16     16     Male           FT     
128   128     16     16     Male           FT     
129   129     16     16     Male           FT     
130   130     16     16     Male           FT     
131   131     16     16     Male           FT     
132   132     16     16     Male           FT     
133   133     16     16     Male           FT     
134   134     16     16     Male           FT     
135   135     16     16     Male           FT     
136   136     16     16     Male         SEFT     
137   137     16     16     Male           UN     
138   138     16     16     Male           UN     
139   139     16     16     Male           UN     
140   140     16     16     Male           UN     
141   141     16     16     Male           UN     
142   142     16     16     Male           UN     
143   143     16     16     Male           UN     
...   ...     ..     ..     ...            ..  

Current code:

UNOTH=sample_n(EdUNAS[EdUNAS$Bage=="16" & EdUNAS$Gender=="Male" & EdUNAS$Employment=="UN" & EdUNAS$Ethnic=="0",],size=1, replace=FALSE)
UNOTH["Ethnic"]="Other"
Edunoth=merge(EdUNAS, UNOTH, by = "ID", all = TRUE)
Edunoth$Bage.x.x.y=NULL
Edunoth$Ages.x.x.y=NULL
Edunoth$Gender.x.x.y=NULL
Edunoth$Employment.x.x.y=NULL
Edunoth[is.na(Edunoth)] = ''
EdUNOTH=unite(Edunoth, Ethnic, Ethnic.x:Ethnic.y, sep='')

Wanted output: The Ethnicity column filled in based proportions I know from the census data.

       ID    Ages    Bage  Gender     Employment   Ethnicity
77     16     16     16     Male           PT        White
78     78     16     16     Male           PT        White  
79     79     16     16     Male           PT        White
80     80     16     16     Male           PT        White 
81     81     16     16     Male           PT        White  
82     82     16     16     Male           PT        White  
83     83     16     16     Male           PT        Asian          
91     91     16     16     Male           PT        White  
92     92     16     16     Male           PT        White  
93     93     16     16     Male           PT        Other  
94     94     16     16     Male           PT        White
95     95     16     16     Male           PT        White
96     96     16     16     Male           PT        White
97     97     16     16     Male           PT        White
98     98     16     16     Male           PT        Asian
99     99     16     16     Male           PT        White
100   100     16     16     Male           PT        White
101   101     16     16     Male           PT        White
102   102     16     16     Male           PT        White
127   127     16     16     Male           FT        White
128   128     16     16     Male           FT        White
129   129     16     16     Male           FT        White
130   130     16     16     Male           FT        White
131   131     16     16     Male           FT        White
132   132     16     16     Male           FT        White
133   133     16     16     Male           FT        White
134   134     16     16     Male           FT        White
135   135     16     16     Male           FT        White
136   136     16     16     Male         SEFT        White
137   137     16     16     Male           UN        White
138   138     16     16     Male           UN        White
139   139     16     16     Male           UN        White
140   140     16     16     Male           UN        White
141   141     16     16     Male           UN        Asian
142   142     16     16     Male           UN        White
143   143     16     16     Male           UN        White
...   ...     ..     ..     ...            ..        ...
lts
  • 1
  • 1
  • 2
  • I think it would be best if you could provide us with a small subset of your data (or preferably simulated dataset) and what the end result would look like. What you need to do is devise an algorithm, from there it's usually straightforward. – Roman Luštrik Nov 27 '15 at 10:42
  • @RomanLuštrik Thank you. I've edited the question to include a sample of the data. I'm fairly new to R so still learning; the lines of code I used do work but as I mentioned it is very time consuming and I have a whole population to do this for so finding a quicker way would be great! – lts Nov 27 '15 at 11:45
  • You're on a good path. See here how to make life easier for everyone: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik Nov 28 '15 at 07:29

0 Answers0