Make a sum of different number of rows related to one column in R

Question

I am new in R and want to get some help if possible. I am analyzing RNAseq data, and for the normalization I need to use the read Counts divided by the sum of the exons length (in kilobase) of each gene in my list.

Basically, I have two .csv files :

one with a list of the read Counts for the sample I want to analyse.
another one in which I have the human GeneID and information about
the the Exon length.

The second one looks like this:

#EnsemblGeneID  #ExonSize
ENSG00000000003 198
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 1316
ENSG00000000003 498
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 27
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 46
ENSG00000000003 311
ENSG00000000003 97
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 243
ENSG00000000003 85
ENSG00000000003 48
ENSG00000000003 218
ENSG00000000005 264
ENSG00000000005 131
ENSG00000000005 140
ENSG00000000005 101
ENSG00000000005 153
ENSG00000000005 166
ENSG00000000005 377
ENSG00000000005 411
ENSG00000000005 101
ENSG00000000005 27
ENSG00000000419 187
ENSG00000000419 99
ENSG00000000419 33
ENSG00000000419 76
ENSG00000000419 25
ENSG00000000419 95

What I want to make is a script able to calculate the sum of the #ExonSize for each #EnsemblGeneID and create a new .csv file to store the results. As you can see, each gene in my list have different exon number so the geneID will be listed in different rows but what I want to get at the end in something like this:

#EnsemblGeneID  #SumExonSize
ENSG00000000003 5121
ENSG00000000005 1871
ENSG00000000419 515

Any help please?

Thanks in advance

Prem · Accepted Answer · 2018-04-20T12:16:37.010

If I understood your question correctly then first you need to load your data in a dataframe like

df <- read.csv("your_path/input.csv", header=T, stringsAsFactors=F)

calculate grouped sum

library(dplyr)

df1 <- df %>%
  group_by(EnsemblGeneID) %>%
  summarise(SumExonSize = sum(ExonSize))

and finally write it into a file using write.csv.

write.csv(df1, "your_path/output.csv", row.names = F)

When you run above mentioned dplyr code on your sample data

df <- structure(list(EnsemblGeneID = c("ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000003", 
"ENSG00000000003", "ENSG00000000003", "ENSG00000000003", "ENSG00000000005", 
"ENSG00000000005", "ENSG00000000005", "ENSG00000000005", "ENSG00000000005", 
"ENSG00000000005", "ENSG00000000005", "ENSG00000000005", "ENSG00000000005", 
"ENSG00000000005", "ENSG00000000419", "ENSG00000000419", "ENSG00000000419", 
"ENSG00000000419", "ENSG00000000419", "ENSG00000000419"), ExonSize = c(198L, 
188L, 74L, 98L, 134L, 83L, 107L, 1316L, 498L, 188L, 74L, 98L, 
134L, 27L, 188L, 74L, 98L, 46L, 311L, 97L, 74L, 98L, 134L, 83L, 
107L, 243L, 85L, 48L, 218L, 264L, 131L, 140L, 101L, 153L, 166L, 
377L, 411L, 101L, 27L, 187L, 99L, 33L, 76L, 25L, 95L)), .Names = c("EnsemblGeneID", 
"ExonSize"), class = "data.frame", row.names = c(NA, -45L))

Output is:

df1
#  EnsemblGeneID   SumExonSize
#1 ENSG00000000003        5121
#2 ENSG00000000005        1871
#3 ENSG00000000419         515

Yes! It was really helpful. I read more about the package and I am using "mutate" rather than "summarize" in order to add the result to the same file. Many thanks again, so kind of you — Cherif, Apr 24 '18 at 13:13

Make a sum of different number of rows related to one column in R

1 Answers1