2

I made up a test data frame like this:

gene <- as.factor(c('A','B','B','B','C','C','D'))
location <- as.integer(c(1,4,5,6,2,3,9))
df <- data.frame(gene, location)

> df
  gene location
1    A        1
2    B        4
3    B        5
4    B        6
5    C        2
6    C        3
7    D        9

I would like to keep unique genes A, B, C, D, and filter out duplicated genes with non-highest location. (e.g. for gene B, only B with location 6 would be kept; for gene C, only C with location 3 would be kept).

So the end result should be like:

  gene location
1    A        1
4    B        6
6    C        3
7    D        9

Does anyone know how can I do this?

ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
Helena
  • 207
  • 3
  • 8

2 Answers2

4

You can use aggregate() or ave() to do that, i.e.,

dfout <- aggregate(location ~ gene, df, FUN = max)

or

dfout <- unique(within(df,location <- ave(location,gene,FUN = max)))

such that

> dfout
  gene location
1    A        1
2    B        6
3    C        3
4    D        9
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
3

If you have a data frame that has more than gene and location, you can try:

df = df[order(df$gene,-df$location),]
df[!duplicated(df$gene),]
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Seems that it does not work in larger dataset, but I fixed it by adding one more line like: `df <- df[order(-df$location),]` `index <- duplicated(df$gene)` `df <- df[!index,]` Although it looks the same as what you suggest. – Helena Dec 28 '19 at 02:05
  • Yes. Just do df = df[order(-df$location); df <- df[!duplicated(df$gene),] – StupidWolf Dec 28 '19 at 02:11
  • Your example data frame is small, so I just showed the solution above, as it is not costly to sort two columns. Useful to state that in your question. – StupidWolf Dec 28 '19 at 02:12