Conditionally removing duplicate rows from a data frame

Question

I have a data frame in R consisting of two columns: 'Genes' and 'Expression'. It has duplicate rows for some of the Genes, however these duplicate entries have differing Expression values. I want to condense the duplicate rows so there is just one row per Gene, and that this row has the largest 'absolute' expression value. See below for example:

For this data frame...

df <- data.frame(Gene=c("AKT","MYC","MYC","RAS","RAS","RAS","TP53"),
                 Expression=c(3,2,6,1,-4,-1,-3))

  Gene Expression
1  AKT          3
2  MYC          2
3  MYC          6
4  RAS          1
5  RAS         -4
6  RAS         -1
7 TP53         -3

I'd like this output..

  Gene Expression
1  AKT          3
2  MYC          6
3  RAS         -4
4 TP53         -3

I can identify the duplicated genes using

duplicated(df$Gene)

But I'm not sure how to exclude those duplicates of lesser absolute value.

Ps - I'm new at this R malarkey..

possible duplicate of [R, conditionally remove duplicate rows](http://stackoverflow.com/questions/10835284/r-conditionally-remove-duplicate-rows) — Alex A., Mar 13 '15 at 17:16

Alex A. · Accepted Answer · 2015-03-15T01:09:44.497

Looks like a job for aggregate().

# Input data frame
df <- data.frame(Gene=c("AKT", "MYC", "MYC", "RAS", "RAS", "RAS", "TP53"),
                 Expression=c(3, 2, 6, 1, -4, -1, -3))

# Maximum absolute value of Expression by Gene
maxabs <- with(df, aggregate(Expression, list(Gene=Gene), FUN=function(x) max(abs(x))))

# Combine with original data frame
df <- merge(df, maxabs, by="Gene")

# Get desired rows
subset(df, abs(Expression) == x)

# Output:
  Gene Expression
1  AKT          3
3  MYC          6
5  RAS         -4
7 TP53         -3

But then what happens if multiple expression measurements on the same gene have the same value which happens to satisfy the filtering condition? You'll still have duplicate rows, but now it doesn't matter which row we choose per gene. All we have to do is add one more step.

For the sake of example, suppose you have one extra row for MYC with an expression value of 6. Then following the same steps gives us this:

# Example of multiple rows after filtering:
  Gene Expression
1  AKT          3
2  MYC          6
3  MYC          6
4  RAS         -4
5 TP53         -3

# Assign the subset to something
df.maxexpr <- subset(df, abs(Expression) == x)

# Remove duplicate genes (all gene rows should be identical)
df.maxexpr[!duplicated(df.maxexpr$Gene), ]

Then the output in this case matches the expected output as before.

Excellent, thanks!! Really appreciate the walkthrough. (btw - missing ',' in the duplicated argument) — Alasdair R, Mar 14 '15 at 12:30
@AlasdairR: Glad it worked for you. Thanks for pointing out the missing comma, fixed. — Alex A., Mar 15 '15 at 01:10

Steven Beaupré · Answer 2 · 2015-03-13T16:58:37.990

1

Here is a solution with dplyr:

df <- data.frame(Gene=c("AKT","MYC","MYC","RAS","RAS","RAS","TP53"),
                 Expression=c(3,2,6,1,-4,-1,-3))

library(dplyr)
df %>% 
  group_by(Gene) %>%
  filter(row_number(desc(abs(Expression))) == 1)

Which gives:

#Source: local data frame [4 x 2]
#Groups: Gene
#
#  Gene Expression
#1  AKT          3
#2  MYC          6
#3  RAS         -4
#4 TP53         -3

edited Mar 13 '15 at 16:58

answered Mar 13 '15 at 15:48

Steven Beaupré

21,343
7
57
77

Conditionally removing duplicate rows from a data frame

2 Answers2