0

I have two measurements I'm trying to collapse into a single column of data. All subjects gave the first kind of measurement, some gave the second. If the second kind is available, it should be imported into the new column. If the second measurement isn't available for that subject, at least the first one is, so that is the measurement that should be imported for that subject. For instance:

subject measureA measureB collapsed
1          X        A        A
2          Y        B        B
3          Z        -        Z

Again, measureB has preference, but if no measureB is available for a given subject, the collapsed column should use measureA.

Here is the code I've used to try to get this to work:

i = 0
newDV = 0
newID = 0

data$subject = 1:135 in frame with measureA
conditional$subject = 1,2,4,5,9,11, etc in frame with measureB

for(i in data$subject) {
    if(data$subject[i] %in% conditional$subject) {newDV[i] <- 
conditional$measureB[conditional$subject[i]]
            newID[i] = conditional$subject[i]
        }else {newDV[i] <- data$measureA[data$subject[i]]
            newID[i] = data$subject[i]}}

I then merged the newID and newDV columns to check that the loop worked correctly. It did not. Here is a snippet from the merged data frame:

newID     newDV
1       1 1.0000000
2       2 0.0000000
3       3 0.7500000
4       5 1.0000000
5       9 0.1666667
6       6 0.3750000
7       7 0.0000000
8       8 0.2500000
9      14 0.2500000

Clearly it is not doing what I hoped; it is skipping several subject numbers. The dataframe gets even stranger further down:

40     40 0.1875000
41     41 0.0625000
42     42 0.2500000
43     75        NA
44     44 0.2500000
45     NA        NA
46     80        NA

It is not progressing sequentially, and leaving NAs (none of this data should be missing).

Is it clear to anyone why the For loop and the code within it are not performing the function I'm trying to execute?

Thanks in advance.

Adam
  • 1
  • 1
  • 1
    In SQL, this operation is called a "coalesce", [here's a good question of implementing coalesce in R](http://stackoverflow.com/q/19253820/903061). – Gregor Thomas Jul 01 '15 at 17:58
  • 3
    You should also get used to R's vectorization, the best "basic" way to approach this would be `df$collapsed <- ifelse(is.na(df$measureB), df$measureA, df$measureB)`. No looping necessary. – Gregor Thomas Jul 01 '15 at 18:00
  • Are you sure that your loop limit is correctly defined? I suspect that you need to replace `for(i in data$subject)` with `for(i in 1:data$subject)`. – RHertel Jul 01 '15 at 18:02
  • Thank you Gregor, your ifelse solution worked perfectly! – Adam Jul 01 '15 at 18:14

1 Answers1

1

A for loop is more than you need for this operation. Indexing will solve the problem nicely.

eg <- data.frame(one = c(1, 2, 3, 4, 5, 6, NA, 8, 9, 0), 
                 two = c(2, 3, 4, NA, 6, 7, NA, 9, 0, NA))
eg$three <- eg$two
eg$three[is.na(eg$three)] <- eg$one[is.na(eg$three)]
eg

The third line assigns your preferred column to a new column. Then, in the fourth line, any location where the new column has an NA is substituted for the value in the first column at that same observation.

If you need to create a new dataframe, as you do in your example above, use merge (and you can subset the dataframes during the merge to retain only those columns of interest).

eg1 <- data.frame(id_key = 1:10,
                  value = c(2, 4, 6, 8, 10, 12, 14, NA, 18, 20),
                  unimportant = letters[1:10])
eg2 <- data.frame(id_key = 1:10,
                  value = c(3, 6, 9, NA, 15, 18, 21, NA, 27, 30))
eg <- merge(eg1[,c("id_key", "value")], eg2, by = "id_key")
eg$value <- eg$value.y
eg$value[is.na(eg$value)] <- eg$value.x[is.na(eg$value)]
eg
ebyerly
  • 662
  • 5
  • 14