1

For a sample dataframe:

survey <- structure(list(id = 1:10, cntry = structure(c(2L, 3L, 1L, 2L, 
2L, 3L, 1L, 1L, 3L, 2L), .Label = c("DE", "FR", "UK"), class = "factor"), 
    age.cat = structure(c(1L, 1L, 2L, 4L, 1L, 3L, 4L, 4L, 1L, 
    2L), .Label = c("Y_15.24", "Y_40.54", "Y_55.plus", "Y_less.15"
    ), class = "factor")), .Names = c("id", "cntry", "age.cat"
), class = "data.frame", row.names = c(NA, -10L))

I want to add an extra column called 'age.cat' that is populated by another dataframe:

age.cat <- structure(list(cntry = structure(c(2L, 3L, 1L), .Label = c("DE", 
    "FR", "UK"), class = "factor"), Y_less.15 = c(0.2, 0.2, 0.3), 
        Y_15.24 = c(0.2, 0.1, 0.2), Y_25.39 = c(0.2, 0.3, 0.1), Y_40.54 = c(0.3, 
        0.2, 0.1), Y_55.plus = c(0.1, 0.2, 0.3)), .Names = c("cntry", 
    "Y_less.15", "Y_15.24", "Y_25.39", "Y_40.54", "Y_55.plus"), class = "data.frame", row.names = c(NA, 
    -3L))

The age.cat dataframe lists proportions of people in the three countries by the different age categories. The corresponding country/age category needs to be added as an additional column in the survey dataframe. Previously, when I used a single country for example, I use merge, but this wouldn't work here as I understand as I need matching on a column and row.

Does anyone have any ideas?

agenis
  • 8,069
  • 5
  • 53
  • 102
KT_1
  • 8,194
  • 15
  • 56
  • 68

4 Answers4

5

Using data.table, I'd do this directly as follows:

require(data.table) # v1.9.6+
dt1[dt2, ratio := unlist(mget(age.cat)), by=.EACHI, on="cntry"]

where,

dt1 = as.data.table(survey)[, age.cat := as.character(age.cat)]
dt2 = as.data.table(age.cat)

For each row in dt2, the matching rows in dt1$cntry are found corresponding to dt2$cntry (it helps to think of it like a subset operation by matching on cntry column). age.cat values for those matching rows are extracted and passed to mget() function, that looks for variables named with the values in age.cat, and finds it in dt2 (we allow for columns in dt2 to be also visible for exactly this purpose), and extracts the corresponding values. Since it returns a list, we unlist it. Those values are assigned to the column ratio by reference.

Since this avoids unnecessary materialising of intermediate data by melting/gathering, it is quite efficient. Additionally, since it adds a new column by reference while joining, it avoids another intermediate materialisation and is doubly efficient.

Personally, I find the code much more straightforward to understand as to what's going on (with sufficient base R knowledge of course), but that is of course subjective.


Slightly more detailed explanation:

The general form of data.table syntax is DT[i, j, by] which reads:

Take DT, subset rows by i, then compute j grouped by by.

The i argument in data.table, in addition to being subset operations e.g., dt1[cntry == "FR"], can also be another data.table.

Consider the expression: dt1[dt2, on="cntry"].

The first thing it does is to compute, for each row in dt2, all matching row indices in dt1 by matching on the column provided in on = "cntry". For example, for dt2$cntry == "FR", the matching row indices in dt1 are c(1,4,5,10). These row indices are internally computed using fast binary search.

Once the matching row indices are computed it looks as to whether an expression is provided in the j argument. In the above expression j is empty. Therefore it returns all the columns from both dt1 and dt2 (leading to a right join).

In other words, data.table allows join operations to be performed in a similar fashion to subsets (because in both operations, the purpose of i argument is to obtain matching rows). For example, dt1[cntry == "FR"] would first compute the matching row indices, and then extract all columns for those rows (since no columns are provided in the j argument). This has several advantages. For example, if we would only like to return a subset of columns, then we can do, for example:

dt1[dt2, .(cntry, Y_less.15), on="cntry"]

This is efficient because we look at the j expression and notice that only those two columns are required. Therefore on the computed row indices, we only extract the required columns thereby avoiding unnecessary materialisation of all the other columns. Hence efficient

Also, just like how we can select columns, we can also compute on columns. For example, what if you'd like to get sum(Y_less.15)?

dt1[dt2, sum(Y_less.15), on="cntry"]
# [1] 2.3

This is great, but it computes the sum on all the matching rows. What if you'd like to get the sum for each row in dt2$cntry? This is where by = .EACHI comes in.

dt1[dt2, sum(Y_less.15), on="cntry", by=.EACHI]
#    cntry  V1
# 1:    FR 0.2
# 2:    UK 0.2
# 3:    DE 0.3

by=.EACHI ensures that the j expression is evaluated for each row in i = dt2.

Similarly, we can also add/update columns while joining using the := operator. And that's the answer shown above. The only tricky part there is to extract the values for those matching rows from dt2, since they are stored in separate columns. Hence we use mget(). And the expression unlist(mget(.)) gets evaluated for each row in dt2 while matching on "cntry" column. And the corresponding values are assigned to ratio by using the := operator.

For more details on history of := operator see this, this and this post on SO.

For more on by=.EACHI, see this post.

For more on data.table syntax introduction and reference semantics, see the vignettes.

Hope this helps.

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387
  • does := operation matches between each element ? %in% can also be used ? –  May 25 '16 at 13:06
  • @Learner, added a detailed explanation. – Arun May 25 '16 at 13:28
  • thanks but it is too long and made me confused even more! however, thanks for putting your effort to explain what := does and I liked you answer but made me more confused :-D –  May 25 '16 at 13:46
  • 1
    It's part of learning new concepts / ways of doing things. It's not necessarily a bad thing. I've provided additional link on `:=` to read upon if you're still curious. – Arun May 25 '16 at 13:48
  • I started learning data.table thanks to you , my first answer I used this package http://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame/37443836#37443836 –  May 25 '16 at 17:23
1

We can do a join after melting the second dataset to 'long' format

library(data.table) #v1.9.7
melt(setDT(age.cat), id.var="cntry")[survey, on = c("cntry",  "variable" = "age.cat")]
#    cntry  variable value id
# 1:    FR   Y_15.24   0.2  1
# 2:    UK   Y_15.24   0.1  2
# 3:    DE   Y_40.54   0.1  3
# 4:    FR Y_less.15   0.2  4
# 5:    FR   Y_15.24   0.2  5
# 6:    UK Y_55.plus   0.2  6
# 7:    DE Y_less.15   0.3  7
# 8:    DE Y_less.15   0.3  8
# 9:    UK   Y_15.24   0.1  9
#10:    FR   Y_40.54   0.3 10

If we are using the CRAN version i.e. data.table_1.9.6,

melt(setDT(age.cat), id.var="cntry", variable.name = "age.cat")[survey, 
                        on = c("cntry",  "age.cat")]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • @agenis Are you using the new version of data.table – akrun May 25 '16 at 12:26
  • @akrun I also got an error saying this Error in forderv(x, by = rightcols) : 'by' value -2147483648 out of range [1,3] –  May 25 '16 at 12:29
  • Perhaps with the version difference? I am using `data.table_1.9.7`, but it should have worked with the CRAN version – akrun May 25 '16 at 12:30
  • @akrun > packageVersion("data.table") [1] ‘1.9.6’ –  May 25 '16 at 12:31
  • @Learner Updated the post. I guess the option of using different column names with `on` is only available in the devel version – akrun May 25 '16 at 12:34
  • @akrun now works, one question , here you used the setDT(age.cat) to convert the age.cat to data.table or you did it for something else ? –  May 25 '16 at 12:44
  • @Learner Yes, I used that to convert to 'data.table' – akrun May 25 '16 at 12:46
  • 1
    @akrun I liked your answer then this could give the answer melt(setDT(age.cat), id.var="cntry") why did you add [survey, on = c("cntry", "variable" = "age.cat")] –  May 25 '16 at 12:53
  • @Learner It is because the `melt` output by default gives the `variable` as the column name which is different from the column name in the second dataset i.e. `age.cat` – akrun May 26 '16 at 02:06
1

You can turn age.cat into long format and then use join as follows:

library(dplyr)
library(tidyr)
age.cat <- gather(age.cat, age.cat, proportion, -cntry)
inner_join(survey, age.cat)
Gopala
  • 10,363
  • 7
  • 45
  • 77
0

you can do this, using packages reshape2 and dplyr:

age.cat %>% melt(variable.name="age.cat") %>% left_join(survey, .)
#### id cntry   age.cat value
#### 1  1    FR   Y_15.24   0.2
#### 2  2    UK   Y_15.24   0.1
#### 3  3    DE   Y_40.54   0.1

Is that what you want?

agenis
  • 8,069
  • 5
  • 53
  • 102