2

Starting with these two data frames:

data <- data.frame("Run_ID" = c(1,2,3), "Sample" = c("A", "B", "C"), "Value" = c(1,2,3))
metadata <- data.frame("Run_ID" = c(1,3), "Sample" = c("A","C"))

I would like to subset data so that it only includes the values from the pairs of Run_ID + Sample which also are present in metadata. The output should contain the same columns as data.

Expected Output:

Run_ID Sample Value
1      A     1
3      C     3

Based on the documentation, it seems like semi_join() should be the solution, but I cannot figure out join based on the two variables.

>semi_join(data, metadata, by = c("Run_ID", "Sample"))
[1] Run_ID Sample Value 
<0 rows> (or 0-length row.names)

Any suggestions are greatly appreciated!

bgfritz1
  • 25
  • 5

2 Answers2

2

Your code is ok but the input metadata is not in a friendly format but I guess this is what you're after:

semi_join(
  data,
  metadata %>% separate_rows(Sample, sep = ','), 
  by = c('Run_ID', 'Sample')
)
#   Run_ID Sample Value
# 1      1      A     1
# 2      3      C     3
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Thanks, that was exactly what I was looking for. I realized that there was a typo in the "Sample" metadata column, which I now fixed... Was that what you were referring in terms of formatting? – bgfritz1 Jun 24 '21 at 10:33
0

Does this work:

library(dplyr)
library(tidyr)
metadata %>% separate_rows(Sample) %>% inner_join(data)
Joining, by = c("Run_ID", "Sample")
# A tibble: 2 x 3
  Run_ID Sample Value
   <dbl> <chr>  <dbl>
1      1 A          1
2      3 C          3
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • Thanks for the suggestion. This works somewhat, but `inner_join()` also adds extra columns to the output if they are present in `metadata`. I would like that the output only returns the columns that were in `data` originally. – bgfritz1 Jun 24 '21 at 09:51
  • @bgfritz1, you can use `select` to select the columns that you need. Just add `%>% select(col1, col2, col3..)` at the end of the above code – Karthik S Jun 24 '21 at 10:00