1

I have a key and a massive metadata table. There's a column in the metadata table which contains values such as the following:

body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)

with body_site as a header. The key looks like this:

Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC

It's a csv of the corresponding abbreviation for each type of tissue. What I want to do is replace all entries in the first table's column with the corresponding abbreviations in the second table's second column.

The problem is, when I take the advice of the highly-popular post which demonstrates how to do this, I somehow end up with a table that only has values for the body_site column; in other words, all other data in that table is deleted except for the data that was replaced. On the plus side, the replacement works, but now I have an otherwise completely empty table, save for headers.

Here's what my code looks like. I included both solutions offered by the top answerer, both of which I tried.

library("data.table")
args = commandArgs(trailingOnly=TRUE)
# SraRunTable.txt is args[1]
#sratabl <- fread(args[1])
sratabl <- fread("SraRunTable.txt")
tiskey <- fread("GTExTissueKey.csv")
# current directory is args [2]
new <- sratabl  # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(sratabl, function(x) tiskey$Key[match(x, tiskey$Tissue)])

new <- sratabl
new[] <- tiskey$Key[match(unlist(sratabl), tiskey$Tissue)]
CelineDion
  • 906
  • 5
  • 21

2 Answers2

1

Here is a solution:

require(data.table)
df1 <- data.frame(a = c("a","b","c"), b = c("x","y","z"))
df2 <- data.frame(a = c("a","c"), b = c("new_x","new_z"))
setDT(df1)
setDT(df2)

# inspect each df
df1
#    a b
# 1: a x
# 2: b y
# 3: c z
df2
#    a     b
# 1: a new_x
# 2: c new_z

l <- match(df1$a, df2$a, nomatch = 0)
df1$b[l != 0] <- df2$b[l]

df1
#    a     b
# 1: a new_x
# 2: b     y
# 3: c new_z
Khaynes
  • 1,976
  • 2
  • 15
  • 27
1
  1. I think you are over-using lapply; since you're working on a single column in a frame, there's no need to use that here.
  2. There are going to be NAs in the results, at least with this data (and you should likely guard against it anyway). Because of this, I suggest an intermediate/temp variable.

For #2 above, I'll keep the variable within the frame (and then remove it) for ease of association, though there's no need for that, it could just as easily be stored in a standalone vector and then assigned after fixing.

df1$tmp <- df2$Key[ match(df1$body_site, df2$Tissue) ]
head(df1)
#                         body_site    tmp
# 1                            Lung   <NA>
# 2                            Lung   <NA>
# 3                Brain - Amygdala BRNAMY
# 4                Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia)   <NA>
# 6 Brain - Caudate (basal ganglia)   <NA>

Those are the NAs you need to be wary of ... this next part uses the new column only if not NA.

df1$tmp <- ifelse(is.na(df1$tmp), df1$body_site, df1$tmp)
head(df1)
#                         body_site                             tmp
# 1                            Lung                            Lung
# 2                            Lung                            Lung
# 3                Brain - Amygdala                          BRNAMY
# 4                Brain - Amygdala                          BRNAMY
# 5 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)
# 6 Brain - Caudate (basal ganglia) Brain - Caudate (basal ganglia)

Now, cleanup:

df1$body_site <- df1$tmp
df1$tmp <- NULL

Alternative: joins.

library(dplyr)
left_join(df1, df2, by=c("body_site" = "Tissue")) %>% head()
#                         body_site    Key
# 1                            Lung   <NA>
# 2                            Lung   <NA>
# 3                Brain - Amygdala BRNAMY
# 4                Brain - Amygdala BRNAMY
# 5 Brain - Caudate (basal ganglia)   <NA>
# 6 Brain - Caudate (basal ganglia)   <NA>

(same cleanup required)

library(data.table)
head( merge(df1, df2, by.x="body_site", by.y="Tissue", all.x=TRUE) )
#                             body_site    Key
# 1:                   Brain - Amygdala BRNAMY
# 2:                   Brain - Amygdala BRNAMY
# 3:    Brain - Caudate (basal ganglia)   <NA>
# 4:    Brain - Caudate (basal ganglia)   <NA>
# 5: Brain - Spinal cord (cervical c-1)   <NA>
# 6: Brain - Spinal cord (cervical c-1)   <NA>

(same cleanup required)


Data:

df1 <- read.csv(header=T, stringsAsFactors=F, text='
body_site
Lung
Lung
Brain - Amygdala
Brain - Amygdala
Brain - Caudate (basal ganglia)
Brain - Caudate (basal ganglia)
Lung
Lung
Skin - Sun Exposed (Lower leg)
Skin - Sun Exposed (Lower leg)
Brain - Spinal cord (cervical c-1)
Brain - Spinal cord (cervical c-1)')

df2 <- read.csv(header=T, stringsAsFactors=F, text='
Tissue,Key
Adipose - Subcutaneous,ADPSBQ
Adipose - Visceral (Omentum),ADPVSC
Adrenal Gland,ADRNLG
Artery - Aorta,ARTAORT
Artery - Coronary,ARTACRN
Artery - Tibial,ARTTBL
Bladder,BLDDER
Brain - Amygdala,BRNAMY
Brain - Anterior cingulate cortex (BA24),BRNACC')
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Incredibly thorough with alternate solutions and a full explanation. Thank you, it worked! – CelineDion Jan 22 '19 at 22:05
  • I tend to prefer the "join" alternatives in my code, though it breaks down if you need partial matches (e.g., "fuzzy joins"). One advantage of the "join" method is that it can be done in-database if your data is originating in a DBMS somewhere, negating the need to pull unnecessary info into R. – r2evans Jan 22 '19 at 22:08
  • Finally, the rationale for `lapply` is if you are doing the same thing to multiple columns, which was the case in the other question to which you linked. In that case, realize that `lapply` is doing one function for each column, not for each element (cell) of a column; the function itself is working on a whole column at a time, aka "whole vector of strings". – r2evans Jan 22 '19 at 22:10