-1

i have one data frame that has repetitive lines. I want to remove repetitive rows and select the row for each sample_id that is col with the highest value of each count. How can i do that?

Sample data (from the comments):

structure(list(gene_id = c("ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "G00000000003", "G00000000003", 
"G00000000003", "G00000000003", "G00000000003", "G00000000003"
), DO221539 = c(681L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DO221540 = c(148L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DO221541 = c(650L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), DO221542 = c(258L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L), DO221543 = c(57L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L), DO221544 = c(224L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L), DO221545 = c(60L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DO221546 = c(161L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), DO224575 = c(15L, 0L, 0L, 
0L, 0L, 949L, 0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, 
-10L))

i want the out put to be

structure(list(gene_id = c("ENSG00000000003") ,DO221539 = 681L,DO221540 = 148L ,DO221541 = 650L, DO221542 = 258L , DO221543 = 57L, DO221544 = 224L, DO221545 = 60L, DO221546 = 61L, DO224575 = 949L, class = "data.frame", row.names = c(NA, -10L))

Anna
  • 17
  • 4
  • Try with `which.max` – akrun Oct 30 '18 at 00:29
  • I have 816541 ob of 325 variable. Each col is sample_id and each row is gene_name. For each row(gene) each sample has the greatest value in the middle of repetitive values, i need to pick up them. – Anna Oct 30 '18 at 00:37
  • Do you want to select the max value in each column? – akrun Oct 30 '18 at 00:38
  • Yes, i want to remove duplicates and select the highest value for each sample based on genes( 1 gene with highest value for sample) – Anna Oct 30 '18 at 00:40
  • 1
    From your comment, I am guessing that you need per row? It would be better if you show a small example in your post with expected output Try `library(dplyr); df1 %>% group_by(sample_id) %>% summarise_all(max)` – akrun Oct 30 '18 at 00:41
  • as it is a bigger dataset, you can also use `data.table` `library(data.table); setDT(df1)[, lapply(.SD, max), by = sample_id]` – akrun Oct 30 '18 at 00:43
  • Sorry, I have the .txt file for head of the data frame. How can i attach a data frame from my computer? – Anna Oct 30 '18 at 00:50
  • 1
    Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. One example: `dput(head(x,n=20))`. – r2evans Oct 30 '18 at 00:50
  • @Anna If you have read the data into R, use `dput` to show a small example, i.e. `dput(droplevels(df1[1:4, 1:4]))` – akrun Oct 30 '18 at 00:51
  • gene_id DO221539 DO221540 DO221541 DO221542 DO221543 DO221544 DO221545 DO221546 DO224575 ENSG00000000003 681 148 650 258 57 224 60 161 15 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 949 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 ENSG00000000003 0 0 0 0 0 0 0 0 0 – Anna Oct 30 '18 at 00:55
  • Not in a comment, please post the output from `dput(...)` into your question. – r2evans Oct 30 '18 at 00:56
  • Anna, I just posted a suggested edit to your question. That is one way to provide sample data for people trying to help. Comments are horrible for significant data and code ... I think I inferred the columns correctly, please correct me if I got it wrong. – r2evans Oct 30 '18 at 01:08
  • I suggest you take the sample data I copied from your comment and please make an R object that is your expected output. That is, with this 10x10 `data.frame`, your expected output could be a 10x1 frame, a vector, a 1x10 frame, or ... something. Using the numbers in that data, construct what you need resulting from this process. – r2evans Oct 30 '18 at 01:10
  • Thanks, it is correct. Just want to know what is L? – Anna Oct 30 '18 at 01:12
  • I put the output for highest values for each sample – Anna Oct 30 '18 at 01:23
  • Thanks akrun, The code which is library(dplyr); df1 %>% group_by(sample_id) %>% summarise_all(max) worked for me very well – Anna Oct 30 '18 at 01:32

1 Answers1

0

We can group by 'gene_id' and get the max of each column with summarise_all

library(tidyverse)
df1 %>% 
   group_by(gene_id) %>% 
   summarise_all(max)
akrun
  • 874,273
  • 37
  • 540
  • 662