Recoding categorical variables to the most common value

Question

This has probably been answered already but I honestly can't find it if it has.

I have tens of millions of observations of cost, a patient ID, gender, region and age cat.

I need average cost per person (ID) with one cost entry for each unique ID number but sometimes the same individual pops up in more than one age category or region. I need a piece of code to assign the most representative age and region to the data associated with each ID user.

As an example, ID user 1 has a birthday during the year and moves from age category 15-24 to 25-34 but I'd like to assign 15-24 to that user because it best describes him. Again, ID user 3 pops up in regions A, D and E but I'd like to assign E to him because it's the most representative region for that user.

If someone could suggest code to assign the most common values in a user's rows to all of those rows I'd be grateful. I can take it from there myself.

Thanks so much

Here's a table of what I have and then a table of what I want. Notice that individual 1 was also entered as female for the final observation so 'three' changes were required to this sample. This happens numerous times in the data set.

ID  Cost    Sex Age Cat Region
1   5.68    M   15-24   A
1   5.63    M   15-24   A
1   5.87    M   15-24   A
1   6.32    M   25-34   A
1   6.45    F   25-34   B
2   2.34    F   65-74   C
2   3.52    F   65-74   C
2   9.85    F   65-74   C
3   7.52    M   35-44   A
3   6.52    M   35-44   D
3   5.32    M   35-44   E
3   2.12    M   35-44   E

ID  Cost    Sex Age Cat Region
1   5.68    M   15-24   A
1   5.63    M   15-24   A
1   5.87    M   15-24   A
1   6.32    M   15-24   A
1   6.45    M   15-24   A
2   2.34    F   65-74   C
2   3.52    F   65-74   C
2   9.85    F   65-74   C
3   7.52    M   35-44   E
3   6.52    M   35-44   E
3   5.32    M   35-44   E
3   2.12    M   35-44   E

Please show few lines of your dataset and the expected result based on that — akrun, Apr 16 '15 at 15:18
Please insert the table data at the bottom of your question with `code` formatting. Don't worry if it looks broken, someone will reformat it. It would also be best to show what output you are expecting as well with the sample input table data. — musically_ut, Apr 16 '15 at 15:21

Jthorpe · Accepted Answer · 2015-04-21T17:36:30.057

1

I think the general answer is that you want to use the split-apply pattern using a function like aggregate or `tapply'. Here's a solution using tapply:

df <- data.frame(id = c(1,1,1,2,2,3),
                 f=factor(c('foo','foo','bar','foo','bar','bar')))
df
#> id   f
#>  1 foo
#>  1 foo
#>  1 bar
#>  2 foo
#>  2 bar
#>  3 bar


# initialize a data.frame with one record per id
df2 <- data.frame(id = unique(df$id))

# identify the most frequent factor level for each person
tmp <- tapply(df$f, # the variable to be summarized
              df$id,# the variable on whith to group the summaries
              function(x){
                  # tabulate the value of F
                  tbl <- table(x)
                  # return (the first of) the most frequent value(s)
                  names(tbl)[which(tbl == max(tbl))[1]]
              })
df2$f <- tmp[match(df2$id,names(tmp))]

df2
#>   id   f
#> 1  1 foo
#> 2  2 bar
#> 3  3 bar

Note that you have to be careful in breaking ties -- you'll notice the 'first value' returned in this example was the first in alphabetical order, since that's the ordering of the factor levels an hence the ordering in the table tbl...

dplyr and data.table solutions will run faster because they create indexes on the grouping variables and then use those indexes to identify related sets of records. That said, there are two problems that come to mind with a dataset with >1.6M records: Running time and Memory. You may want to benchmark the time to create the index by using a trivial summary function like function(x)0, and then know in that the creating the index is order n*log(n) you can calculate the running time for the full set (provided you're not running out of memory).

For this task, I suggest using data.table using setkey(DT,id) to index the id column before aggregating the data. There are plenty of examples of this type of task with data.table, including the introductory vignette (see vignette('datatable-intro'))

edited Apr 21 '15 at 17:36

answered Apr 16 '15 at 17:26

Jthorpe

9,756
2
49
64

Thanks a million folks. I'll try this and if it doesn't suit for some reason I'll try to add the table using the code tab. – steve Apr 20 '15 at 08:38
I've tried the code Jthorpe. On a test sample it worked perfectly but on the full sample (60,000,000 obs) it's unfeasibly slow. I'll leave it running all night and see what happens. As you can see from the tables I've added, I need to do it three times also. Is there a more efficient way? – steve Apr 20 '15 at 16:16
solutions based on `dplyr` and (especially) `data.table` are generally faster for large data sets ... – Ben Bolker Apr 20 '15 at 16:36
I changed it to a data.table but to no avail. Also Jthorpes code works on a made up sample but not on the actual data set. I'm quite new to R and am pretty badly stuck here if anyone has any suggestions. – steve Apr 21 '15 at 08:28
Ok, Jthorpe's code works perfectly in smaller samples of the actual dataset. Once I try to run it on the full data set (>1.6M obs of 8 variables) it runs and runs with no sign of completing. Could anyone suggest a more efficient way, or perhaps insert the code for parallel processing into Jthorpe's answer? Thanks again. – steve Apr 21 '15 at 10:59
See edits to my response regarding dealing with larger datasets. Also, Hadley Wickham's [Advanced R](http://adv-r.had.co.nz/) is a great resource for understand how to deal with these types of problems, especially the sections on [performance](http://adv-r.had.co.nz/Performance.html) and [memory](http://adv-r.had.co.nz/memory.html) – Jthorpe Apr 21 '15 at 17:37
Thanks so much Jthorpe that's a huge help. Worst case scenario it will take five days to run but it would have taken twice that for me to figure it out on my own. By the way 1.6M obs was a misprint, it's 60M obs. Thanks again. – steve Apr 22 '15 at 08:42

Recoding categorical variables to the most common value

1 Answers1

Linked