0

I have 2 dataframes

df1 #This one is in fact the mapping file (300 rows)
Sample_title        Sample_geo_accession
EC2003090503AA                 GSM118720
EC2003090502AA                 GSM118721
EC2003090504AA                 GSM118722

df2 #(300 rows)
cmap_name   concentration (M)   perturbation_scan_id    vehicle_scan_id3
metformin   0.00001 EC2003090503AA  EC2003090502AA
metformin   0.00001 EC2003090504AA  EC2003090502AA
metformin   0.0000001   EC2003090503AA  EC2003090502AA

I want to read every line in df2 and replace the perturbation_scan_id and vehicle_scan_id3 by the !Sample_geo_accession in df1.

The final output will be:

df3 
cmap_name   concentration_M perturbation_scan_id    vehicle_scan_id3
metformin   0.00001         GSM118720               GSM118721
metformin   0.00001         GSM118722               GSM118721
metformin   0.0000001       GSM118720               GSM118721
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
user3253470
  • 191
  • 1
  • 4
  • 11

1 Answers1

2

One possible solution (if you have many of these columns that you want to replace) is to first create a row index and the "melt" the data by cmap_name, concentration_M and the new index. Then perform a single merge against the new value column (which will have all the values of these column in a single long column), and then "decast" the data back into a wide format. Here's a possible data.table implementation. Please make sure you have the latest version from CRAN. I also made your column names more "R friendly" so it will be easier to work with

library(data.table) # V 1.9.6+
temp <- melt(setDT(df2)[, indx := .I], id = c(1:2, 5))[df1, on = c(value = "Sample_title")]
dcast(temp, cmap_name + concentration_M + indx ~ variable, value.var = "Sample_geo_accession")
#    cmap_name concentration_M indx perturbation_scan_id vehicle_scan_id3
# 1: metformin           1e-07    3            GSM118720        GSM118721
# 2: metformin           1e-05    1            GSM118720        GSM118721
# 3: metformin           1e-05    2            GSM118722        GSM118721

A similar dplyr/tidyr implementation

library(dplyr)
library(tidyr)
df2 %>%
  mutate(indx = row_number()) %>%
  gather(variabe, value, -c(1:2, 5)) %>%
  left_join(., df1, by = c("value" = "Sample_title")) %>%
  select(-value) %>%
  spread(variabe, Sample_geo_accession)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196