This has probably been answered already but I honestly can't find it if it has.
I have tens of millions of observations of cost, a patient ID, gender, region and age cat.
I need average cost per person (ID) with one cost entry for each unique ID number but sometimes the same individual pops up in more than one age category or region. I need a piece of code to assign the most representative age and region to the data associated with each ID user.
As an example, ID user 1 has a birthday during the year and moves from age category 15-24 to 25-34 but I'd like to assign 15-24 to that user because it best describes him. Again, ID user 3 pops up in regions A, D and E but I'd like to assign E to him because it's the most representative region for that user.
If someone could suggest code to assign the most common values in a user's rows to all of those rows I'd be grateful. I can take it from there myself.
Thanks so much
Here's a table of what I have and then a table of what I want. Notice that individual 1 was also entered as female for the final observation so 'three' changes were required to this sample. This happens numerous times in the data set.
ID Cost Sex Age Cat Region
1 5.68 M 15-24 A
1 5.63 M 15-24 A
1 5.87 M 15-24 A
1 6.32 M 25-34 A
1 6.45 F 25-34 B
2 2.34 F 65-74 C
2 3.52 F 65-74 C
2 9.85 F 65-74 C
3 7.52 M 35-44 A
3 6.52 M 35-44 D
3 5.32 M 35-44 E
3 2.12 M 35-44 E
ID Cost Sex Age Cat Region
1 5.68 M 15-24 A
1 5.63 M 15-24 A
1 5.87 M 15-24 A
1 6.32 M 15-24 A
1 6.45 M 15-24 A
2 2.34 F 65-74 C
2 3.52 F 65-74 C
2 9.85 F 65-74 C
3 7.52 M 35-44 E
3 6.52 M 35-44 E
3 5.32 M 35-44 E
3 2.12 M 35-44 E