How to compare duplicated values and filter out unwanted ones in a R data frame?

Question

I made up a test data frame like this:

gene <- as.factor(c('A','B','B','B','C','C','D'))
location <- as.integer(c(1,4,5,6,2,3,9))
df <- data.frame(gene, location)

> df
  gene location
1    A        1
2    B        4
3    B        5
4    B        6
5    C        2
6    C        3
7    D        9

I would like to keep unique genes A, B, C, D, and filter out duplicated genes with non-highest location. (e.g. for gene B, only B with location 6 would be kept; for gene C, only C with location 3 would be kept).

So the end result should be like:

  gene location
1    A        1
4    B        6
6    C        3
7    D        9

Does anyone know how can I do this?

ThomasIsCoding · Answer 1 · 2019-12-27T09:34:26.503

4

You can use aggregate() or ave() to do that, i.e.,

dfout <- aggregate(location ~ gene, df, FUN = max)

or

dfout <- unique(within(df,location <- ave(location,gene,FUN = max)))

such that

> dfout
  gene location
1    A        1
2    B        6
3    C        3
4    D        9

edited Dec 27 '19 at 09:34

answered Dec 27 '19 at 09:26

ThomasIsCoding

96,636
9
24
81

I wonder what should I specify the "~ gene" here if I have a larger data frame with not just two columns? – Helena Dec 28 '19 at 02:11
@Helena then you can use `dfout <- aggregate(. ~ gene, df, FUN = max)` – ThomasIsCoding Dec 28 '19 at 10:27

score 3 · Answer 2 · answered Dec 27 '19 at 09:29

3

If you have a data frame that has more than gene and location, you can try:

df = df[order(df$gene,-df$location),]
df[!duplicated(df$gene),]

answered Dec 27 '19 at 09:29

StupidWolf

45,075
17
40
72

Seems that it does not work in larger dataset, but I fixed it by adding one more line like: `df <- df[order(-df$location),]` `index <- duplicated(df$gene)` `df <- df[!index,]` Although it looks the same as what you suggest. – Helena Dec 28 '19 at 02:05
Yes. Just do df = df[order(-df$location); df <- df[!duplicated(df$gene),] – StupidWolf Dec 28 '19 at 02:11
Your example data frame is small, so I just showed the solution above, as it is not costly to sort two columns. Useful to state that in your question. – StupidWolf Dec 28 '19 at 02:12

How to compare duplicated values and filter out unwanted ones in a R data frame?

2 Answers2