1

edited - fixed the typos that Frank and ycw pointed out

I am having trouble identifying pairs of values in a dataframe that exist in another data frame. Because my data is in long format and each column contains the same value many times, I'm getting overly broad results when I try to identify matches. I've tried a number of ways and can't get the exact results I need. Given how simple the operation is, I expect I'm missing something very obvious.

For my example, I have a set of person records in long format, df ppl. I also have another set of person observations for specific time periods, in this example, it's absences on certain days dfabs. I want to mutate ppl to create a new column, that will show when a person-day row matches a person-day observation in df dfabs. I keep getting a problem when I try to mutate however, as it's not clear to me how to subset based on two columns at once.

Example below:

name <- c("Bob", "Bob", "Ana","Ana", "Jorge","Jorge")
day <- c(1,2,1,2,1,2)
eval <- c(4,3,5,5,2,5)
ppl <- as.data.frame(cbind(name,day, eval), names=TRUE)

   name day eval
1   Bob   1    4
2   Bob   2    3
3   Ana   1    5
4   Ana   2    5
5 Jorge   1    2
6 Jorge   2    5

nameabs <- c("Bob","Jorge","Cindy")
dayabs <- c(1,2,1)
dfabs <- as.data.frame(cbind(nameabs, dayabs), names=TRUE)
dfabs

  nameabs dayabs
1     Bob      1
2   Jorge      2
3   Cindy      1

The result I want to see is:

    name day eval absent
1   Bob   1    4      1
2   Bob   2    3      0
3   Ana   1    5      0
4   Ana   2    5      0
5 Jorge   1    2      0
6 Jorge   2    5      1

To do this, I've tried using the mutate function:

mutate(ppl, absent = ifelse(((name %in% dfabs$nameabs) & (day %in% dfabs$dayabs)),1,0))

Result:

   name day eval absent
1   Bob   1    4      1
2   Bob   2    3      1
3   Ana   1    5      0
4   Ana   2    5      0
5 Jorge   1    2      1
6 Jorge   2    5      1

This is not right. It lists Bob and Jorge as absent on both days. It seems that in my ifelse() statement, it evaluated the two logical expressions (name %in% dfabs$nameabs) and (day %in% dfabs$dayabs) on separate rows.

Using other functions such as filter() gives me the same problem. I can't figure out how to make it evaluate the entire expression on one row. In general, with and without dplyr, I've been having a lot of trouble with this general problem: how to subset one df based on a pair's (across two columns) presence in another df? I'm working with a much larger and more complex data set, and having a generalizeable answer would be very helpful.

Thanks in advance for the help!

MrMr
  • 108
  • 6
  • There are errors when I ran your code. Please correct your code. – www Aug 29 '17 at 17:49
  • 1
    By the way, it is not an efficient way to use `cbind` to create a data frame since you are using `data.frame` already. `abs` is a built-in R function. Please use other names for your data frame if possible. – www Aug 29 '17 at 17:50
  • This is just a left join. `left_join(ppl, abs, by = c("Name" = "nameabs", "Day" = "dayabs"))` – Gregor Thomas Aug 29 '17 at 17:52
  • Fyi, you say your data is in wide format, but actually that's called long format. Also, here's the data.table analogue if you're interested: https://stackoverflow.com/q/40314562/ – Frank Aug 29 '17 at 17:53
  • I'm still confused about how to use the left join to get to TRUE/FALSE evaluation I need inside `ifelse()` statement, if my mutate call is going to work. – MrMr Aug 29 '17 at 18:09
  • I don't think it's accurate to mark this as a duplicate of the join question? I think there are a few extra steps here that are not immediately intuitive. – MrMr Aug 29 '17 at 18:11
  • 1
    @MrMr I tend to agree, but it *can* be answered with some such joins. Here's one clumsy way: `dfabs %>% mutate(v = 1) %>% right_join(ppl, by = c("nameabs" = "name", "dayabs" = "day")) %>% mutate(v = as.integer(!is.na(v)))` I guess there is a different dupe that fits better somewhere (similar to the data.table link in my last comment). – Frank Aug 29 '17 at 18:13

0 Answers0