0

I have a gene expression dataset "rna" with probe IDs. I also have a reference data set "ref" with probe IDs and their corresponding entrez ID. I want to map the probe IDs from "rna" to those in "ref" so that I can add entrez ID to "rna". In my reference data set, there are probes mapped to multiple entrez ID, so I would also need to dupicate those rows in "rna" so that each row maps to only one entrez ID (but keep same info). The outcome I am looking for is "org_rna". There are also some duplicated entrez IDs that can be left. TYIA

rna = data.frame("Org1" = c(1.5, 3.5, 2.4, 3.2, 4.5), "Org2" = c(2.5, 3.5,7, 2.6, 7), 
             "Org3" = c(3.6,7.2,4,5,6), "Probe" = c("11715100_at", "11715101_s_at", 
                                                    "11715102_x_at", "11715103_x_at", "11715104_s_at"))

ref = data.frame("Probe Set ID" = c("11715100_at", "11715101_s_at", "11715102_x_at", "11715103_x_at", 
                                "11715104_s_at"), "Entrez" = c("8355", "8355", "340307 /// 441294",
                                                          "285501", "8263 /// 474383 /// 474384"))                                          


org_rna = data.frame( "Org1" = c(1.5, 3.5, 2.4, 2.4, 3.2, 4.5, 4.5, 4.5), "Org2" = c(2.5, 3.5,7,7, 2.6, 7,7,7), 
                  "Org3" = c(3.6,7.2,7.2,4,5,6,6,6), "Probe" = c("11715100_at", "11715101_s_at", 
                                  "11715102_x_at", "11715102_x_at", "11715103_x_at", "11715104_s_at", 
                                  "11715104_s_at", "11715104_s_at"), "Entrez" = c("8355", "8355",
                                    "340307", "441294", "285501", "8263", "474383", "474384"))   
jimuq233
  • 53
  • 4

1 Answers1

0

step 1

Convert ref to a dataframe where each Entrez-id has its own row:

ref_match <- ref
ref_match$Entrez <- strsplit(ref_match$Entrez, ' /// ')
ref_match <- tidyr::unnest(ref_match, Entrez)

See R - make multiple elements list to data frame rows for alternative methods.

step 2

Join rna and ref_match where rna$Probe == ref_match$Probe.Set.ID:

rna_matched <- dplyr::left_join(rna, ref_match, by=c('Probe' = 'Probe.Set.ID'))

cleanup

rm(ref_match)

You'll have noticed these examples require the libraries tidyr and dplyr, both part of the tidyverse package; you might need to install them first. The result in rna_matched is identical to your example.

Caspar V.
  • 1,782
  • 1
  • 3
  • 16