0

I have a spreadsheet that have >100 columns, and many columns have the same names. I would like to merge those columns with the same names and row-sum the values in those columns. I think conditional execution, if(), should do it, but I'm stuck at writing the condition for identical column names? And what will be the function to merge and sum the columns? merge()? or rowsum()?

aa <- read.table()

if (colnames(aa) == ) merge/rowsum()

Thanks.

This is a sample of what it looks like now:

B C U B C
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4

And this is what I hope to get: Reduction of the number of columns and sum the values when merging.

B C U
2 2 1
4 4 2
6 6 3
8 8 4
Community
  • 1
  • 1

3 Answers3

3

Solution 1

Using split(), lapply(), rowSums(), and do.call()/cbind():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 2

Replacing the rowSums() call with Reduce()/`+`():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 3

Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:

do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
##      B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Benchmarking

library(microbenchmark);

bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));

df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
##          expr     min       lq     mean   median      uq      max neval
##  bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742  641.052   100
##  bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618   100
##  bgoldst3(df)  82.110  92.5875 100.9138  97.2915 107.128  170.207   100
##     sotos(df) 200.997 211.9030 226.7977 223.6630 235.210  328.010   100

set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
##          expr       min        lq      mean    median        uq      max neval
##  bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929   100
##  bgoldst2(df)  4.534402  4.680446  6.161428  6.097900  6.425697 44.83254   100
##  bgoldst3(df)  3.430203  3.555505  5.355128  4.919931  5.219930 41.79279   100
##     sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525   100
bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • No problem at all :) – bgoldst May 19 '16 at 08:13
  • 1
    Thanks, bgoldst! Solution 1 and 2 gave this error: Error in df[x] : object of type 'closure' is not subsettable. But Solution 3 worked. Good enough for me! – Soon Hwee N May 19 '16 at 11:52
  • @SoonHweeN Thanks! The error message you got must have been caused by `df` not being assigned to any value. There is a built-in function `df()` in the `stats` package which is in the search path, and so if you haven't assigned `df` yourself, then any reference to `df` in code will bind to the function. All 3 solutions should work if `df` has been assigned to the input data.frame. – bgoldst May 19 '16 at 12:03
2

One way to do it,

sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]))

#     B C U
#[1,] 2 2 1
#[2,] 4 4 2
#[3,] 6 6 3
#[4,] 8 8 4
Sotos
  • 51,121
  • 6
  • 32
  • 66
1

Here is another option with melt/dcast from data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), create a row number column ('rn'), melt from 'wide' to 'long' format and then dcast it to 'wide' by specifying the fun.aggregate as sum.

library(data.table)
setDT(df1)[, rn :=  1:.N]
dcast(melt(df1, id.var= "rn"), rn ~variable, value.var="value", sum)[, rn:= NULL][]
#   B C U
#1: 2 2 1
#2: 4 4 2
#3: 6 6 3
#4: 8 8 4
akrun
  • 874,273
  • 37
  • 540
  • 662