-3

My input looks like this. I want to make 2 new columns - one column will be the number of duplicates of gene names and another will be sum of the values. Can anyone help?

Input:

gene1   5
gene1   4
gene2   7
gene3   6
gene3   2
gene3   3

Expected output:

gene1    2    9
gene2    1    7
gene3    3    11

Data:

dd <- read.table(header = FALSE, stringsAsFactors = FALSE, text="gene1   5
gene1   4
gene2   7
gene3   6
gene3   2
gene3   3")
rawr
  • 20,481
  • 4
  • 44
  • 78
Rashedul Islam
  • 879
  • 1
  • 10
  • 23
  • Please put the input in a easy reproducible way, for instance, using `dput`in R. Also, what have you tried? – iled Dec 15 '15 at 20:44
  • 1
    `aggregate(dd, by = dd['V1'], function(x) if (is.numeric(x)) sum(x) else length(x))` – rawr Dec 15 '15 at 20:55

3 Answers3

3
awk 'BEGIN {print "Gene\tCount\tSum"} {a[$1]+=$2;b[$1]++} END {for (i in a) {print i"\t"b[i]"\t"a[i]}}' file

Gene    Count   Sum
gene1   2   9
gene2   1   7
gene3   3   11
user2138595
  • 187
  • 7
  • It becomes quite a bit clearer with meaningful variable names: `awk '{cnt[$1]++; sum[$1]+=$2} END{for (gene in cnt) print gene, cnt[gene], sum[gene]}' file` – Ed Morton Dec 16 '15 at 01:23
1

This is the sort of thing dplyr is made for. The pipe operator also makes the syntax easy to understand. "col1" and "col2", you'll have to replace by the appropriate names in de code below:

library('dplyr')
df %>% group_by(col1) %>%
    summarise(count=n(),
    sum=sum(col2))
rawr
  • 20,481
  • 4
  • 44
  • 78
mtoto
  • 23,919
  • 4
  • 58
  • 71
1

Please provide actual reuseable code. See this question for details.

First, we create the test data:

#libraries
library(stringr);library(plyr)

#test data
df = data.frame(gene = str_c("gene", c(1, 1, 2, rep(3, 3))),
                count = c(5, 4, 7, 6, 2, 3))

Then we summarize with ddply from plyr package:

#ddply
ddply(df, .(gene), summarize,
      gene_count = length(count),
      sum = sum(count)
)

What this does is take a data.frame, split it by the value of the gene column, then summarize in the two desired ways. See Hadley's introduction to the split, apply and combine route.

Result:

   gene gene_count sum
1 gene1          2   9
2 gene2          1   7
3 gene3          3  11

There are lots of other ways to do the same.

Community
  • 1
  • 1
CoderGuy123
  • 6,219
  • 5
  • 59
  • 89
  • you're only using stringr to do this? `paste0('gene', c(1, 1, 2, rep(3, 3)))` – rawr Dec 15 '15 at 21:05
  • I always use stringr, except for the case where I need to return the matched values because this is not supported for some reason. In that case I fall back to just `grep`. – CoderGuy123 Dec 16 '15 at 21:22