0

I've just transitioned to using R from SAS and I'm working with a very large data set (half a million observations and 20 thousand variables) that needs quite a bit of recoding. I imagine this is a pretty basic question, but I'm still learning so I'd really appreciate any guidance!

Many of the variables have three instances and each instance has multiple arrays. For this problem, I am using the "History of Father's Illness." There are many illnesses included, but I am primarily interested in CAD (coded as "1").

An example of how the data looks:

n_20107_0_0   n_20107_0_1     n_20107_0_2
    NA             NA              NA
    7             1                8
    4             6                1             

I've only included 3 arrays here, but in reality there are close to 20. I did a bit of research and determined that the most efficient way to do this would be to create a list with the variables and then use lapply. This is what I have attempted:

 FatherDisease1 <- paste("n_20107_0_", 0:3, sep = "")
lapply(FatherDisease1, transform, FatherCAD_0_0 = ifelse(FatherDisease1 == 1, 1, 0))

I don't quite get the results I am looking for when I do this.

 n_20107_0_0   n_20107_0_1     n_20107_0_2  FatherCAD_0_0
   NA             NA              NA             0
    7             1                8             0
    4             6                1             0

What I would like to do is go through all of the 3 instances and if the person had answered 1, then for "FatherCAD_0_0" to equal 1, if not then "FatherCAD_0_0" equals 0, but I only ever end up with 0's. As for the NA's I would like for them to stay as NAs. This is what I would like it to look like:

n_20107_0_0   n_20107_0_1     n_20107_0_2  FatherCAD_0_0
   NA             NA              NA            NA
    7             1                8             1
    4             6                1             1

I've figured out how to do this the "long" way (30+ lines of code -_-) but am trying to get better at writing more elegant and efficient code. Any help would be greatly appreciated!!

user7777508
  • 101
  • 1
  • 3
  • 9
  • You'll get a much better response if you follow these guidelines: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – boshek Apr 28 '17 at 21:48
  • To expand on boshek's comment, it will help somewhat to be a little more specific and clarify some terms: "array" is something rather specific in R, and I doubt that's really what you mean. Is `FatherDisease1` a raw list or a data frame? You can share readable versions of the example data via copy+paste-ing the output from `dput()`. – joran Apr 28 '17 at 21:53
  • 2
    `df$FatherCAD_0_0 <- as.integer(rowSums(df == 1) > 0)` – alistaire Apr 28 '17 at 22:00

1 Answers1

1

Assuming your data is in a data.frame you could use apply to loop over each row and check if any of the columns you are interested have a 1:

FatherDisease1 <- paste("n_20107_0_", 0:2, sep = "")
df$FatherCAD_0_0 <- apply(df, 1, function(x) as.integer(any(x[FatherDisease1] == 1)))

df
#  n_20107_0_0 n_20107_0_1 n_20107_0_2 FatherCAD_0_0
#1          NA          NA          NA            NA
#2           7           1           8             1
#3           4           6           1             1

Data:

df <- structure(list(n_20107_0_0 = c(NA, 7L, 4L), n_20107_0_1 = c(NA, 
1L, 6L), n_20107_0_2 = c(NA, 8L, 1L)), .Names = c("n_20107_0_0", 
"n_20107_0_1", "n_20107_0_2"), row.names = c(NA, -3L), class = "data.frame")
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • 1
    More standard than ifelse(x, 1, 0) would be as.integer(x), I guess. There's also `rowSums(df==1, na.rm=TRUE) > 0` which should be faster than apply-any. – Frank Apr 28 '17 at 22:02
  • 1
    Thanks Frank - the `as.integer` is definitely better than my `ifelse`. It has the added benefit that if you have 2 `NA`s and a `1` it will return 1 (the `ifelse` gave `NA`). Although `rowSums` is definitely the best bet – Mike H. Apr 28 '17 at 22:06
  • Thank you so much @MikeH. and Frank, that worked! Would you mind explaining to me why anything that is not equal to 1 is recoded as 0? For example say I wanted to recode those values not equal to 1 as 2 instead of 0. In the future I will need to recode additional values such as -11, -13 and -23 as NAs. – user7777508 Apr 30 '17 at 04:27
  • It's because we use a logic check (if its true, the value will be 1, if its false the value will be 0). If you want it so that if the case is false you return a 2 instead of 0, you could go back to using an `ifelse` – Mike H. May 01 '17 at 19:28