I have a data.table
of events recording, say, user ID, country of residence, and event.
E.g.,
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here). So when I try to summarize the data:
dt[, country[.N] , by=user]
user V1
1: 3 2
2: 4 2
I get the wrong country for user 3. Ideally, I would like to get the most common country for a user and the percentage of time he spent there:
user country support
1: 3 1 0.8
2: 4 2 1.0
How do I do that?
The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table
and not data.frame
after all).