I am new in R and want to get some help if possible. I am analyzing RNAseq data, and for the normalization I need to use the read Counts divided by the sum of the exons length (in kilobase) of each gene in my list.
Basically, I have two .csv files :
- one with a list of the read Counts for the sample I want to analyse.
- another one in which I have the human GeneID and information about
the the Exon length.
The second one looks like this:
#EnsemblGeneID #ExonSize
ENSG00000000003 198
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 1316
ENSG00000000003 498
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 27
ENSG00000000003 188
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 46
ENSG00000000003 311
ENSG00000000003 97
ENSG00000000003 74
ENSG00000000003 98
ENSG00000000003 134
ENSG00000000003 83
ENSG00000000003 107
ENSG00000000003 243
ENSG00000000003 85
ENSG00000000003 48
ENSG00000000003 218
ENSG00000000005 264
ENSG00000000005 131
ENSG00000000005 140
ENSG00000000005 101
ENSG00000000005 153
ENSG00000000005 166
ENSG00000000005 377
ENSG00000000005 411
ENSG00000000005 101
ENSG00000000005 27
ENSG00000000419 187
ENSG00000000419 99
ENSG00000000419 33
ENSG00000000419 76
ENSG00000000419 25
ENSG00000000419 95
What I want to make is a script able to calculate the sum of the #ExonSize
for each #EnsemblGeneID
and create a new .csv file to store the results.
As you can see, each gene in my list have different exon number so the geneID will be listed in different rows but what I want to get at the end in something like this:
#EnsemblGeneID #SumExonSize
ENSG00000000003 5121
ENSG00000000005 1871
ENSG00000000419 515
Any help please?
Thanks in advance