Creating summary table using two separate datasets in data.table R

Question

Language: R

Package: data.table

I have two datasets: One where all the country names are listed (the key dataset), and one where I have some data about a population in which one field is Country of birth. And I want to make data.table of the number of people born in each country. Now there might be some countries where none of the participants were born. How can I create such a table using data.table commands?

Example:

CountryKey = data.table(Country = LETTERS[1:5])
DT = data.table(Person = c(1,2,3,4,5,6,7), Birth = LETTERS[c(1,1,2,3,2,1,3)])

DT
   Person Birth
1:      1     A
2:      2     A
3:      3     B
4:      4     C
5:      5     B
6:      6     A
7:      7     C

CountryKey
   Country
1:       A
2:       B
3:       C
4:       D
5:       E

DT[, j = .(.N), by = .(Birth)]
   Birth N
1:     A 3
2:     B 2
3:     C 2

Since there was noone born in D, and E they do not show up in the output. However I want to see them with a value of zero.

@Frank asnwered this question.

DT[CountryKey, on=c(Birth = "Country"), .N, by=.EACHI]
   Birth N
1:     A 3
2:     B 2
3:     C 2
4:     D 0
5:     E 0

However, the CountryKey file might not be unique in its rows. Let's say it is just some dataset similar to DT, with many rows of the same country. How can we do it then?

CountryKey = data.table(Country =c('A','B','B','C','C','D','E','D','D'))

   Country
1:       A
2:       B
3:       B
4:       C
5:       C
6:       D
7:       E
8:       D
9:       D

The above solution gives this result which is not what I want. I want to see the unique entries for each country only.

DT[CountryKey, on=c(Birth = "Country"), .N, by=.EACHI]
   Birth N
1:     A 3
2:     B 2
3:     B 2
4:     C 2
5:     C 2
6:     D 0
7:     E 0
8:     D 0
9:     D 0

@Frank's answer partially works:

DT[unique(CountryKey), on=c(Birth = "Country"), .N, by=.EACHI]
   Birth N
1:     A 3
2:     B 2
3:     C 2
4:     D 0
5:     E 0

However, if the CountryKey has several columns, it does not:

   CountryKey = data.table(Country =c('A','B','B','C','C','D','E','D','D'), 
Somevalue = sample(x = 3, size = 9, replace = T))

DT[unique(CountryKey), on=c(Birth = "Country"), .N, by=.EACHI]
   Birth N
1:     A 3
2:     B 2
3:     B 2
4:     C 2
5:     C 2
6:     D 0
7:     E 0
8:     D 0

Solution by @Frank:

DT[i = unique(CountryKey$Country), on="Birth", .N, by=.EACHI]
P.S. : You should have 1.9.7. Earlier versions will give an error.

Did you notice my earlier comment? http://stackoverflow.com/q/25869543/ and `DT[CountryKey, on=c(Birth = "Country"), .N, by=.EACHI]` — Frank, Apr 25 '16 at 17:33
Yes I did and was going to write that here, however you did not give me enough time. Thanks. — ilyas, Apr 25 '16 at 17:37
Ok, no problem. I'll mark it as a dupe then, but can undo that if necessary — Frank, Apr 25 '16 at 17:38
like a table? `data.frame(table(factor(DT$Birth, levels = CountryKey$Country)))` — rawr, Apr 25 '16 at 17:44
@Frank Please don't do. I don't think these are the same questions. — ilyas, Apr 25 '16 at 17:45
Ok. Not sure if this is what you're after but `DT[unique(CountryKey), on=c(Birth = "Country"), .N, by=.EACHI]` is one simple extension. — Frank, Apr 25 '16 at 17:47
Ok, try `DT[unique(CountryKey$Country), on="Birth", .N, by=.EACHI]`. The vignettes for the package are a good way to get used to how the syntax works: https://github.com/Rdatatable/data.table/wiki/Getting-started — Frank, Apr 25 '16 at 18:05

score 3 · Accepted Answer · answered Apr 25 '16 at 19:46

3

Here's one way:

DT[.(unique(CountryKey$Country)), .N, on="Birth", by=.EACHI]

The OP reports that this gives an error in data.table 1.9.6 (saying Birth is a factor). In that case, run DT[, Birth := as.character(Birth)] first.

answered Apr 25 '16 at 19:46

Frank

66,179
8
96
180

AOGSTA · Answer 2 · 2016-04-25T19:45:05.683

2

I think coding DT$Birth as a factor with levels being the unique values of CountryKey$Country is actually the most elegant way to approach this (conceptually and practically). The factor class was written exactly to handle situations like this. Also this works with data.tables as well as data.frames.

CountryKey = data.table(Country = LETTERS[1:5])
DT = data.table(Person = c(1,2,3,4,5,6,7), Birth = LETTERS[c(1,1,2,3,2,1,3)])

DT$Birth_factor<-as.factor(DT$Birth)
levels(DT$Birth_factor)<-unique(CountryKey$Country)
table(DT$Birth_factor)

The results of table()

> table(DT$Birth_factor)

A B C D E 
3 2 2 0 0

The data.table-specific analogue might look like:

DT[, Birth := factor(Birth, levels = union(Birth, CountryKey$Country))]
DT[levels(Birth), .N, on = "Birth", by=.EACHI]

edited Apr 25 '16 at 19:45

answered Apr 25 '16 at 17:38

AOGSTA

698
4
11

"best way" based on what metric exactly? And fix those `<-`'s and replace them with `:=`. – eddi Apr 25 '16 at 17:48
@Arman Thanks for the input. I know how to use table. I was trying to do this the data.table way. But I like your answer too. – ilyas Apr 25 '16 at 17:49
@Frank Did you try out your answer? The second one gives an error. – ilyas Apr 25 '16 at 17:57
@ilyas Hm, it works for me (data.table 1.9.7, R 3.2.4). If you mention the specific error maybe we can figure out how to fix it. Otherwise, I can revert my edit to this answer. – Frank Apr 25 '16 at 17:59
@Frank I have R 3.2.4 too. How can I check my package version? I get the following error: Error in bmerge(i, x, leftcols, rightcols, io, xo, roll, rollends, nomatch, : x.'Birth' is a factor column being joined to i.'NA' which is type 'NULL'. Factor columns must join to factor or character columns. – ilyas Apr 25 '16 at 18:06
1

@ilyas It might be more convenient to discuss in the R chat room: http://chat.stackoverflow.com/rooms/25312/r-public – Frank Apr 25 '16 at 18:09
@eddi By best, I mean "elegant." Country truly is a factor variable and therefore should be treated as such. Doing so allows us to solve this problem using the levels attribute of the factor class - we don't need anything more complicated than R's built-in object-orientated class/attribute structure. I guess it's all subjective. I'm sure OP has his reasons for using data.table. – AOGSTA Apr 25 '16 at 18:10
@ArmanOganisian Yes, this is all subjective, hence saying "*I think this is the best way*" instead of "*this is the best way*" is what you probably should say. And don't worry, [eddi has nothing against `data.table`](http://stackoverflow.com/help/badges/3755/data.table) – David Arenburg Apr 25 '16 at 19:37
Fair enough. I'll be more precise in my language going forward :) – AOGSTA Apr 25 '16 at 19:44

Creating summary table using two separate datasets in data.table R

2 Answers2