Creating compound/interacted dummy variables in data.table in R

Question

Still learning this awesome package data.table. I am working on the following data.table:

demo <- data.table(id = c(1, 2, 3, 4, 5, 6), sex = c(1, 2, 1, 2, 2, 2), agef = c(43, 53, 63, 73, 83, 103))

demo: 
id sex agef
 1  1   43
 2  2   53
 3  1   63
 4  2   73
 5  2   83
 6  2   103

I am trying to generate new columns (age_gender bands) as ("F0_34","F35_44","F45_54","F55_59"........"F95_GT") and ("M0_34","M35_44","M45_54","M55_59"........"M95_GT") based on the value of column sex and agef their names and value will be generated. I am able to do in a simple way :

demo <- demo[ ,F0_34:= {ifelse((sex==2) & (agef >= 0) & (agef <= 34), 1, 0)}]

But I was looking for an elegant solution for this and I tried to pass age_band as a list in lapply function, as following:

i <- list("0_34","35_44","45_54","55_59","60_64","65_69","70_74","75_79","80_84","85_89","90_94","95_GT") 

demo[, paste0("F", i) := lapply(i, function(i)lapply(.SD, function(x){
l1 <- unlist(str_split(i, "_"))
if(l1[2] == "GT") l1[2] <- 1000
l1 <- as.numeric(l1)
score <- ifelse((sex==2) & (agef >= l1[1]) & (agef <= l1[2]), 1, 0)
return(score)  
})), .SDcols = c("sex", "agef"), by = id]

demo[, paste0("M", i) := lapply(i, function(i)lapply(.SD, function(x){
l1 <- unlist(str_split(i, "_"))
if(l1[2] == "GT") l1[2] <- 1000
l1 <- as.numeric(l1)
score <- ifelse((sex==1) & (agef >= l1[1]) & (agef <= l1[2]), 1, 0)
return(score)  
})), .SDcols = c("sex", "agef"), by = id]

I am getting the desired output:

id  sex agef    F0_34   F35_44  F45_54  F55_59  F60_64  F65_69  F70_74  F75_79  F80_84  F85_89  F90_94  F95_GT  M0_34   M35_44  M45_54  M55_59  M60_64  M65_69  M70_74  M75_79  M80_84  M85_89  M90_94  M95_GT
1   1   43      0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0
2   2   53      0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
3   1   63      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0
4   2   73      0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
5   2   83      0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
6   2   103     0       0       0       0       0       0       0       0       0       0       0       1       0       0       0       0       0       0       0       0       0       0       0       0

but with some warnings :

Warning messages:
1: In `[.data.table`(demographic1, , `:=`(paste0("F", i),  ... :
RHS 1 is length 2 (greater than the size (1) of group 1). The last 1    element(s) will be discarded.

which I am not able to understand, could someone point out what I am doing wrong?

I will run the same code for columns starting with "M". I will edit the code. — nsDataSci, Jun 05 '15 at 18:04
OP read Hadley's take on mixing sex and age in the same column http://vita.had.co.nz/papers/tidy-data.pdf — Pierre L, Jun 05 '15 at 18:27
I don't think mixing categorical vars is to be entirely forbidden, but I'll also have a look at that. @nsDataSci I'd suggest a different title. Your column names are independent of the data in the table and instead determined by the cutpoints you have chosen yourself. How about "Creating compound/interacted dummy variables in data.table"? There is a simpler version of this question with a title like that: http://stackoverflow.com/questions/18881073/creating-dummy-variables-in-r-data-table — Frank, Jun 05 '15 at 18:30
@plafort : thanks I will take a look at that... but you can't avoid that situation when you have secondary value based on that mixing like relative coefficients for that band. — nsDataSci, Jun 05 '15 at 18:38
@frank : Yeah absolutely make sense.. I will change the title to make it more clear.. Thanks — nsDataSci, Jun 05 '15 at 18:40

Pierre L · Accepted Answer · 2015-06-05T20:01:21.360

3

Is this what you're looking for:

age.brackets <- c(0,seq(35,55, by=10), seq(60,95, by=5), Inf) #age ranges
ranges <- (cut(demo$agef, age.brackets))
split(demo, demo$sex)
spread <- table(demo$agef, ranges) #identify persons in each range
male.spread <- (demo$sex=='1')*as.matrix(spread)
female.spread <- (demo$sex=='2')*as.matrix(spread)

newdt <- data.table(
  cbind(
    demo,
    matrix(as.vector(male.spread), ncol=ncol(male.spread)),
    matrix(as.vector(female.spread), ncol=ncol(female.spread))
    )
)


    #column names
names(newdt) <- c(names(demo), 
                  levels(cut(demo$agef, age.brackets)),
                  levels(cut(demo$agef, age.brackets))
                  )
female.names <- gsub('.(\\d*),(\\d*|Inf).', 'F\\1_\\2', levels(cut(demo$agef, age.brackets)))        
male.names <- gsub('.(\\d*),(\\d*|Inf).', 'M\\1_\\2', levels(cut(demo$agef, age.brackets)))
names(newdt) <- c(names(demo), female.names, male.names)


newdt

#    id sex agef F0_35 F35_45 F45_55 F55_60 F60_65 F65_70 F70_75 F75_80 F80_85 F85_90
# 1:  1   1   43     0      1      0      0      0      0      0      0      0      0
# 2:  2   2   53     0      0      0      0      0      0      0      0      0      0
# 3:  3   1   63     0      0      0      0      1      0      0      0      0      0
# 4:  4   2   73     0      0      0      0      0      0      0      0      0      0
# 5:  5   2   83     0      0      0      0      0      0      0      0      0      0
# 6:  6   2  103     0      0      0      0      0      0      0      0      0      0
#    F90_95 F95_Inf M0_35 M35_45 M45_55 M55_60 M60_65 M65_70 M70_75 M75_80 M80_85 M85_90
# 1:      0       0     0      0      0      0      0      0      0      0      0      0
# 2:      0       0     0      0      1      0      0      0      0      0      0      0
# 3:      0       0     0      0      0      0      0      0      0      0      0      0
# 4:      0       0     0      0      0      0      0      0      1      0      0      0
# 5:      0       0     0      0      0      0      0      0      0      0      1      0
# 6:      0       0     0      0      0      0      0      0      0      0      0      0
#    M90_95 M95_Inf
# 1:      0       0
# 2:      0       0
# 3:      0       0
# 4:      0       0
# 5:      0       0
# 6:      0       1

edited Jun 05 '15 at 20:01

answered Jun 05 '15 at 16:52

Pierre L

28,203
6
47
69

I'm not sure I follow how this solution works. It seems to rely on (1) at most one person being in each band (since `table` would other wise report numbers above 1) and (2) people being sorted by age (so that cbind lines up ids correctly)...? – Frank Jun 05 '15 at 17:54
@Frank more than one person can be in each age group. A value more than one will show up if someone is two ages at the same time. And `spread` inherits its order from `demo$agef` which has the same order as `demo` no sorting necessary. – Pierre L Jun 05 '15 at 18:04
My issue with the solution is that it combines male and female classifications, OP wants males and females on two separate tables. I've been working on a project, so I haven't been able to update yet. – Pierre L Jun 05 '15 at 18:07
I think the solution will work well if i could create a male `spread` and a female one. Quick and easy. Any suggestions? @Frank – Pierre L Jun 05 '15 at 18:15
Not sure, I'm thinking something like `cbind(demo, F = (demo$agef==2)*as.matrix(spread), M= (demo$agef==1)*as.matrix(spread))`, but that's not quite right. – Frank Jun 05 '15 at 18:21
@Frank OP should also realize that mixing ages and sex in the same columns is bad news. Age metrics by sex are easily accomplished with a tidy data set. – Pierre L Jun 05 '15 at 18:26
Thanks Guys! apart from above issues for my output i require exact column names like F0_34 F35_44.... as later I need to multiply these data.table columns based on their colnames – nsDataSci Jun 05 '15 at 18:27
Yes, I agree, from a data-tidyness perspective. It may be for passing the data to a regression command that doesn't permit interaction terms in its formula argument, I suppose. – Frank Jun 05 '15 at 18:28
2

OP I figured out a way to extend the columns for male and female. Thank you @Frank for the idea, it worked with a slight edit. instead of `demo$agef` I mulitiplied `demo$sex`. – Pierre L Jun 05 '15 at 19:58

MichaelChirico · Answer 2 · 2015-11-20T18:57:53.780

This should work and is more data.table-y:

cut_points <- c(0, seq(35, 55, by = 10), seq(60, 95, by = 5),Inf)
new_names_m <- paste0("M", cut_points[1:12], "_", c(cut_points[2:12], "GT"))
new_names_f <- paste0("F", cut_points[1:12], "_", c(cut_points[2:12], "GT"))
demo[sex == 1, ranges := cut(agef, cut_points, include.lowest = TRUE,
                        labels = new_names_m)]
demo[sex == 2, ranges := cut(agef, cut_points, include.lowest = TRUE,
                        labels = new_names_f)]
demo[ ,(c(new_names_m, new_names_f)) :=
       lapply(c(new_names_m, new_names_f), function(x) +(ranges == x))]
demo[ , ranges := NULL]

> demo
   id sex agef M0_35 M35_45 M45_55 M55_60 M60_65 M65_70 M70_75 M75_80 M80_85 M85_90 M90_95 M95_GT F0_35 F35_45 F45_55 F55_60 F60_65
1:  1   1   43     0      1      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
2:  2   2   53     0      0      0      0      0      0      0      0      0      0      0      0     0      0      1      0      0
3:  3   1   63     0      0      0      0      1      0      0      0      0      0      0      0     0      0      0      0      0
4:  4   2   73     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
5:  5   2   83     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
6:  6   2  103     0      0      0      0      0      0      0      0      0      0      0      0     0      0      0      0      0
   F65_70 F70_75 F75_80 F80_85 F85_90 F90_95 F95_GT
1:      0      0      0      0      0      0      0
2:      0      0      0      0      0      0      0
3:      0      0      0      0      0      0      0
4:      0      1      0      0      0      0      0
5:      0      0      0      1      0      0      0
6:      0      0      0      0      0      0      1

Alternately, instead of the lapply in the second-to-last line, one could initialize the dummies to zero and then assign ones in the appropriate positions:

new_names = c(new_names_f, new_names_m)
demo[ , (new_names) := 0L]
is = which(demo$ranges != "")   
js = 3L + match(demo$ranges[is], new_names) 
for (iter in seq_along(is)) set(demo, i = is[iter], j = js[iter], value = 1L)

I've edited in a way I think may be better than all of the `==` scanning required by the `lapply`. Also, I think there is no need to keep the first part of the answer (without the "M" columns), since the OP will edit or has edited to make it obsolete. — Frank, Jun 05 '15 at 18:09
Fair enough. Though mine's much more readable, it's for sure going to be slower. — MichaelChirico, Jun 05 '15 at 18:11

Creating compound/interacted dummy variables in data.table in R

2 Answers2