1

I have data that comes to me with many similar variables, with an additional variable which indicates which one of those similar variables I really want. Using a loop I can look up the correct value, but the data is large, the loop is slow, and it seems like this should be vectorizable. I just haven't figured out how.

EDIT: The selected variable will be used as a new variable in the same data frame, so order matters. There are many other variables not shown in the example given below.

Example data set:

set.seed(0)
df <- data.frame(yr1 = sample(1000:1100, 8),
                 yr2 = sample(2000:2100, 8),
                 yr3 = sample(3000:3100, 8),
                 yr4 = sample(4000:4100, 8),
                 var = paste0("yr", sample(1:4, 8, replace = TRUE)))
# df
# 
#    yr1  yr2  yr3  yr4 var
# 1 1090 2066 3050 4012 yr3
# 2 1026 2062 3071 4026 yr2
# 3 1036 2006 3098 4038 yr1
# 4 1056 2020 3037 4001 yr4
# 5 1088 2017 3075 4037 yr3
# 6 1019 2065 3089 4083 yr4
# 7 1085 2036 3020 4032 yr1
# 8 1096 2072 3061 4045 yr3

This loop method does the trick, but is slow and awkward:

ycode <- character(nrow(df))
for(i in 1:nrow(df)) {
 ycode[i] <- df[i, df$var[i]]
}
df$ycode <- ycode

# df
#    yr1  yr2  yr3  yr4 var ycode
# 1 1090 2066 3050 4012 yr3  3050
# 2 1026 2062 3071 4026 yr2  2062
# 3 1036 2006 3098 4038 yr1  1036
# 4 1056 2020 3037 4001 yr4  4001
# 5 1088 2017 3075 4037 yr3  3075
# 6 1019 2065 3089 4083 yr4  4083
# 7 1085 2036 3020 4032 yr1  1085
# 8 1096 2072 3061 4045 yr3  3061 

It seems like I should be able to vectorize this, like so:

df$ycode <- df[, df$var]

But I find the result surprising:

#    yr1  yr2  yr3  yr4 var ycode.yr3 ycode.yr2 ycode.yr1 ycode.yr4 ycode.yr3.1 ycode.yr4.1 ycode.yr1.1 ycode.yr3.2
# 1 1090 2066 3050 4012 yr3      3050      2066      1090      4012        3050        4012        1090        3050
# 2 1026 2062 3071 4026 yr2      3071      2062      1026      4026        3071        4026        1026        3071
# 3 1036 2006 3098 4038 yr1      3098      2006      1036      4038        3098        4038        1036        3098
# 4 1056 2020 3037 4001 yr4      3037      2020      1056      4001        3037        4001        1056        3037
# 5 1088 2017 3075 4037 yr3      3075      2017      1088      4037        3075        4037        1088        3075
# 6 1019 2065 3089 4083 yr4      3089      2065      1019      4083        3089        4083        1019        3089
# 7 1085 2036 3020 4032 yr1      3020      2036      1085      4032        3020        4032        1085        3020
# 8 1096 2072 3061 4045 yr3      3061      2072      1096      4045        3061        4045        1096        3061

I also tried numerous variations on *apply, but none of those even came close. Some attempts:

> apply(df, 1, function(x) x[x$var])
Error in x$var : $ operator is invalid for atomic vectors
> apply(df, 1, function(x) x[x[var]])
Error in x[var] : invalid subscript type 'closure'

Any ideas? Many thanks..

Brian Stamper
  • 2,143
  • 1
  • 18
  • 41
  • As noted below, the solution is to use `match` with `names` to get a vector of indices, and then use indexing to get the values out of the data frame. In the example above, since the variables are in columns 1:4, the solution looks like: `df$ycode <- df[1:4][cbind(1:nrow(df), match(df$var, names(df)[1:4]))]` – Brian Stamper Feb 08 '16 at 21:30
  • I later realized that you don't even need to specify the columns, especially helpful if your variables are spread throughout your data. So just removing the `[1:4]` parts from above: `df$ycode <- df[cbind(1:nrow(df), match(df$var, names(df)))]`. – Brian Stamper Feb 10 '16 at 15:19

4 Answers4

1

We can use the row/column indexing. It should be fast compared to the loop.

 df[-ncol(df)][cbind(1:nrow(df),match(df$var,head(names(df),-1)))]
 #[1] 3050 2062 1036 4001 3075 4083 1085 3061

Just for some diversity, a data.table solution would be (should be slow compared to the indexing above). Convert the 'data.frame' to 'data.table' (setDT(df)), grouped by the sequence of rows, we get the value of 'var' after converting to character class.

library(data.table)
setDT(df)[, ycode := get(as.character(var)) , 1:nrow(df)]
df
#    yr1  yr2  yr3  yr4 var ycode
#1: 1090 2066 3050 4012 yr3  3050
#2: 1026 2062 3071 4026 yr2  2062
#3: 1036 2006 3098 4038 yr1  1036
#4: 1056 2020 3037 4001 yr4  4001
#5: 1088 2017 3075 4037 yr3  3075
#6: 1019 2065 3089 4083 yr4  4083
#7: 1085 2036 3020 4032 yr1  1085
#8: 1096 2072 3061 4045 yr3  3061
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    Indexing is indeed the way. Parts of this answer is almost _too_ clever - it took me a minute to get the -ncol(df) and head(names(df),-1) bits, because on the real data these variables aren't just the first n-1 variables, and there are other variables not mentioned. If the variables are in columns 12:40, say, the answer would be then written like: `df$ycode <- df[12:40][cbind(1:nrow(df), match(df$var, names(df)[12:40]))]` – Brian Stamper Feb 08 '16 at 21:16
0

I like the syntax of dplyr and tidyr:

df$ID = 1:nrow(df)
library(dplyr)
library(tidyr)

df %>% 
    gather(year, value, yr1:yr4) %>% 
    filter(var == year) %>% 
    select(-year) %>%
    spread(year, value) %>%
    arrange(ID)
CPhil
  • 917
  • 5
  • 11
  • The results of this are sorted in a different order than the original data, so I cannot match it up. Edited my question to note that I need to maintain order, because there are many other variables that need to be matched with the result. – Brian Stamper Feb 08 '16 at 20:56
  • If there is value in the ordering, then you should store this as an ID column. If you have done so, then you can take the output of the above and add ` %>% spread( year, value) %>% arrange(ID)`. See edited answer – CPhil Feb 08 '16 at 21:00
0

I noticed this answer from @josliber see (https://stackoverflow.com/a/30279903/4606130) when trying to work on a data.table solution and it seems fast:

df[cbind(seq(df$var), df$var)]

[1] "3050" "2062" "1036" "4001" "3075" "4083" "1085" "3061"
Community
  • 1
  • 1
micstr
  • 5,080
  • 8
  • 48
  • 76
  • This does work, but interesting to note that the results have been coerced to strings. I also ran some timings comparing this to the indexing method in the answer from @akrun, and it turns out that indexing is faster, especially on larger data. Thanks! – Brian Stamper Feb 08 '16 at 21:01
0

One more vectorized option is to use a nested ifelse(). It has the benefit of being, at least in my opinion, relatively readable compared to other solutions. But the obvious downside of not scaling when the number of variables grows.

ifelse(df$var == "yr1", df$yr1,
  ifelse(df$var == "yr2", df$yr2,
  ifelse(df$var == "yr3", df$yr3,
  ifelse(df$var == "yr4", df$yr4, NA))))

[1] 3050 2062 1036 4001 3075 4083 1085 3061
Ilari Scheinin
  • 776
  • 9
  • 14