1

I have a data frame in R consisting of two columns: 'Genes' and 'Expression'. It has duplicate rows for some of the Genes, however these duplicate entries have differing Expression values. I want to condense the duplicate rows so there is just one row per Gene, and that this row has the largest 'absolute' expression value. See below for example:

For this data frame...

df <- data.frame(Gene=c("AKT","MYC","MYC","RAS","RAS","RAS","TP53"),
                 Expression=c(3,2,6,1,-4,-1,-3))

  Gene Expression
1  AKT          3
2  MYC          2
3  MYC          6
4  RAS          1
5  RAS         -4
6  RAS         -1
7 TP53         -3

I'd like this output..

  Gene Expression
1  AKT          3
2  MYC          6
3  RAS         -4
4 TP53         -3

I can identify the duplicated genes using

duplicated(df$Gene)

But I'm not sure how to exclude those duplicates of lesser absolute value.

Ps - I'm new at this R malarkey..

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
Alasdair R
  • 13
  • 4
  • possible duplicate of [R, conditionally remove duplicate rows](http://stackoverflow.com/questions/10835284/r-conditionally-remove-duplicate-rows) – Alex A. Mar 13 '15 at 17:16

2 Answers2

1

Looks like a job for aggregate().

# Input data frame
df <- data.frame(Gene=c("AKT", "MYC", "MYC", "RAS", "RAS", "RAS", "TP53"),
                 Expression=c(3, 2, 6, 1, -4, -1, -3))

# Maximum absolute value of Expression by Gene
maxabs <- with(df, aggregate(Expression, list(Gene=Gene), FUN=function(x) max(abs(x))))

# Combine with original data frame
df <- merge(df, maxabs, by="Gene")

# Get desired rows
subset(df, abs(Expression) == x)

# Output:
  Gene Expression
1  AKT          3
3  MYC          6
5  RAS         -4
7 TP53         -3

But then what happens if multiple expression measurements on the same gene have the same value which happens to satisfy the filtering condition? You'll still have duplicate rows, but now it doesn't matter which row we choose per gene. All we have to do is add one more step.

For the sake of example, suppose you have one extra row for MYC with an expression value of 6. Then following the same steps gives us this:

# Example of multiple rows after filtering:
  Gene Expression
1  AKT          3
2  MYC          6
3  MYC          6
4  RAS         -4
5 TP53         -3

# Assign the subset to something
df.maxexpr <- subset(df, abs(Expression) == x)

# Remove duplicate genes (all gene rows should be identical)
df.maxexpr[!duplicated(df.maxexpr$Gene), ]

Then the output in this case matches the expected output as before.

Alex A.
  • 5,466
  • 4
  • 26
  • 56
1

Here is a solution with dplyr:

df <- data.frame(Gene=c("AKT","MYC","MYC","RAS","RAS","RAS","TP53"),
                 Expression=c(3,2,6,1,-4,-1,-3))

library(dplyr)
df %>% 
  group_by(Gene) %>%
  filter(row_number(desc(abs(Expression))) == 1)

Which gives:

#Source: local data frame [4 x 2]
#Groups: Gene
#
#  Gene Expression
#1  AKT          3
#2  MYC          6
#3  RAS         -4
#4 TP53         -3
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77