Summarize a data.table with unreliable data

Question

I have a data.table of events recording, say, user ID, country of residence, and event. E.g.,

dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 country=c(rep(1,4),rep(2,6)),
                 event=1:10, key="user")

As you can see, the data is somewhat corrupted: event 5 reports user 3 as being in country 2 (or maybe he traveled - it does not matter to me here). So when I try to summarize the data:

dt[, country[.N] , by=user]
   user V1
1:    3  2
2:    4  2

I get the wrong country for user 3. Ideally, I would like to get the most common country for a user and the percentage of time he spent there:

   user country support
1:    3       1     0.8
2:    4       2     1.0

How do I do that?

The actual data has ~10^7 rows, so the solution has to scale (this is why I am using data.table and not data.frame after all).

Arun · Accepted Answer · 2013-04-24T22:23:45.440

7

Another way:

Edited. table(.) was the culprit. Changed it to complete data.table syntax.

dt.out<- dt[, .N, by=list(user,country)][, list(country[which.max(N)], 
               max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"), c("country", "support"))
#    user country support
# 1:    3       1     0.8
# 2:    4       2     1.0

edited Apr 24 '13 at 22:23

answered Apr 24 '13 at 20:00

Arun

116,683
26
284
387

this is horribly slow (and generates the "named vector" performance note) – sds Apr 24 '13 at 20:21
Edited. `table` was the problem. Now it runs in under 1 second on my laptop on 1e6 row data.table. For the future, it'd be useful to know the actual size of your problem. The `named vector` is a message and usually doesn't seem to affect performance that much at all. You can check it. I've removed the names here. It'd be nice *if you to put in some effort to find the problem for slowness* and report back instead of just saying *horribly slow*. – Arun Apr 24 '13 at 22:25
For the future, you've framed your question not in the format of *how to optimise this?* but as a *how to do this?*. Also, you've not mentioned the size of your data, nor does your example show it. It's only normal to expect **a** solution that shows how to do that, not necessarily the optimised one. Testing with data sure helps write better code. – Arun Apr 25 '13 at 09:22
your solution is no faster than @eddi's while he uses sort and you are not. I guess my groups are small enough for the log to be inconsequential. – sds Apr 25 '13 at 19:38
@sds, In my benchmark, it turned out to be faster by 0.5 *seconds* as I mention in his comment (on 1e6 rows). – Arun Apr 25 '13 at 19:40
@sds, In my constructed example `dt <- data.table(user = rep(1:1e5, each=100), country=sample(1:10, 1e7, replace=TRUE))`, my solution runs in under 2 secs. Can you benchmark on it? I'd like to know why the huge difference. How many users do you have and what's the maximum number of countries amongst all users? – Arun Apr 25 '13 at 19:42
On this data: `dt <- data.table(user = rep(1:2e6, each=5), country=sample(1:3, 1e7, replace=TRUE))`, it takes about 15 seconds. You say 1 minute... I'm wondering what's making this huge difference. Could you post the data somewhere? – Arun Apr 25 '13 at 19:45
yes, just under 16sec. it's just that my data is larger and not as uniform (and it is not public, sorry) -- thanks a lot for your help! – sds Apr 25 '13 at 19:52

eddi · Answer 2 · 2013-04-24T21:52:27.333

4

Using plyr's count function:

dt[, count(country), by = user][order(-freq),
                                list(country = x[1],
                                     support = freq[1]/sum(freq)),
                                by = user]
#   user country support
#1:    4       2     1.0
#2:    3       1     0.8

Idea is to count the countries per user, order by max frequency and then get the data you like.

A smarter answer thanks to @mnel, that doesn't use extra functions:

dt[, list(freq = .N),
     by = list(user, country)][order(-freq),
                               list(country = country[1],
                                    support = freq[1]/sum(freq)),
                               by = user]

edited Apr 24 '13 at 21:52

answered Apr 24 '13 at 19:53

eddi

49,088
6
104
155

1

is `count` from `plyr` package? – Arun Apr 24 '13 at 20:01
hah, yes! :) I just typed `count(country)` without checking where it comes from in my environment since that's what I'd call the function that did this - and it worked :) – eddi Apr 24 '13 at 20:03
`count` is also very slow (like my old solution using `table`). Using `data.table` grouping (your second solution) is much faster. Probably better to remove the first solution altogether. Yours can be improved further (it runs in 1.55 secs against my 1sec on my 1e6 rows test) by looking at the optimisation messages. – Arun Apr 24 '13 at 22:29
I don't think you need to use `order` – sds Apr 25 '13 at 19:39
@sds, you do. Try this: `dt[order(user,-country)][, list(freq=.N), by = list(user, country)]` to see why – eddi Apr 25 '13 at 19:41
you only need max, not full order – sds Apr 25 '13 at 19:42
@sds ok, I guess you're saying that Arun's solution is more efficient, and I agree – eddi Apr 25 '13 at 19:45

Summarize a data.table with unreliable data

2 Answers2

Linked