Subset a dataframe in R based on another dataframe, taking multiple hits

Question

I have two dataframes

df1

names   target   start  end

Gene_1  chr5     1      345
Gene_2  chr1     1      678
Gene_3  chr4     1      909
Gene_4  chr48    1      876
Gene_5  chr8     1      432
Gene_6  chr9     1      556
Gene_7  chr12    1      345

df2

gene_names   positions

Gene_1       221
Gene_2       34
Gene_2       444
Gene_2       324
Gene_3       99
Gene_3       232
Gene_4       221
Gene_4       334
Gene_4       390
Gene_6       200
Gene_7       146

df1 is way shorter than df2. The first column of df2 has repeated observations with different values in the second column. Its first column lacks some match (a lot) of the observations in the homologous column in df1.

I wanted to merge them into a df_new, which contains the gene_names and the other column from df2 together with the related information from the other column of df1, even repeated when the observation in gene_names appears 2+ times.

I paved my way with merge

df_new<-merge(df2, df1, by.x = "gene_names", by.y = "names")

and I have a tentatively result of which I am mostly unsure. Someone can shed more light?

Intended output

df_new

gene_names   positions  target   start  end

Gene_1       221        chr5     1      345
Gene_2       34         chr1     1      678
Gene_2       444        chr1     1      678
Gene_2       324        chr1     1      678
Gene_3       99         chr4     1      909
Gene_3       232        chr4     1      909
Gene_4       221        chr48    1      876
Gene_4       334        chr48    1      876
Gene_4       390        chr48    1      876
Gene_6       200        chr9     1      556
Gene_7       146        chr12    1      345

Based on the data shared `df_new<-merge(df2, df1, by.x = "gene_names", by.y = "names")` gives the expected output it seems. What is not working in your original data? — Ronak Shah, Aug 13 '21 at 07:36
I expected it to contain a slightly higher number of observations. Maybe I should cross-check the quality of the data I'm merging... — Alkaligrass, Aug 13 '21 at 10:29
``` df_new<-merge(df1, df2, by.x = "names", by.y = ""gene_names"")``` returns the same, sems to me, am I right? — Alkaligrass, Aug 13 '21 at 10:42

TarJae · Answer 1 · 2021-08-13T07:16:56.097

We could use right_join with df1 and df2:

library(dplyr)
df_result <- df2 %>% 
  right_join(df1, by= c("gene_names"="names"))

output:

   gene_names positions target start   end
   <chr>          <dbl> <chr>  <dbl> <dbl>
 1 Gene_1           221 chr5       1   345
 2 Gene_2            34 chr1       1   678
 3 Gene_2           444 chr1       1   678
 4 Gene_2           324 chr1       1   678
 5 Gene_3            99 chr4       1   909
 6 Gene_3           232 chr4       1   909
 7 Gene_4           221 chr48      1   876
 8 Gene_4           334 chr48      1   876
 9 Gene_4           390 chr48      1   876
10 Gene_6           200 chr9       1   556
11 Gene_7           146 chr12      1   345
12 Gene_5            NA chr8       1   432

data:

df1 <- structure(list(names = c("Gene_1", "Gene_2", "Gene_3", "Gene_4", 
"Gene_5", "Gene_6", "Gene_7"), target = c("chr5", "chr1", "chr4", 
"chr48", "chr8", "chr9", "chr12"), start = c(1, 1, 1, 1, 1, 1, 
1), end = c(345, 678, 909, 876, 432, 556, 345)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -7L))

df2 <- structure(list(gene_names = c("Gene_1", "Gene_2", "Gene_2", "Gene_2", 
"Gene_3", "Gene_3", "Gene_4", "Gene_4", "Gene_4", "Gene_6", "Gene_7"
), positions = c(221, 34, 444, 324, 99, 232, 221, 334, 390, 200, 
146)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -11L))

score 1 · Answer 2 · answered Aug 13 '21 at 16:58

1

We can use merge from base R

merge(df1, df2, all.y = TRUE, by.x = "gene_names", by.y = "names")

answered Aug 13 '21 at 16:58

akrun

874,273
37
540
662

Subset a dataframe in R based on another dataframe, taking multiple hits

2 Answers2