How to collapse very large sparse dataframes

Question

I want to sum about 10000 columns like colSparseX on 1500 sparse rows of an dataframe. If I have the input:

(I tried on OriginalDataframe this:

coldatfra <- aggregate(. ~colID,datfra,sum)

and this:

coldatfra <- ddply(datfra, .(colID), numcolwise(sum))

But it doesn't work!)

colID <- c(rep(seq(1:6),2), rep(seq(1:2),3))
colSparse1 <- c(rep(1,5), rep(0,4), rep(1,2), rep(0,5), rep(1,2))
cPlSpars2 <- c(rep(1,3), rep(0,6), rep(1,2), rep(0,5), rep(1,2))
coMSparse3 <- c(rep(1,6), rep(0,3), rep(1,2), rep(0,5), rep(1,2))
colSpArseN <- c(rep(1,2), rep(0,7), rep(1,2), rep(0,5), rep(1,2))

(datfra <- data.frame(colID, colSparse1, cPlSpars2, coMSparse3, colSpArseN))

colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
    1          1         1          1          1
    2          1         1          1          1
    3          1         1          1          0
    4          1         0          1          0
    5          1         0          1          0
    6          0         0          1          0
    1          0         0          0          0
    2          0         0          0          0
    3          0         0          0          0
    4          1         1          1          1
    5          1         1          1          1
    6          0         0          0          0
    1          0         0          0          0
    2          0         0          0          0
    1          0         0          0          0
    2          0         0          0          0
    1          1         1          1          1
    2          1         1          1          1

And want to sum the elements for each ID on all (10000 columns - requires some placeholder for colnames are very variable words) colSparses in order to get this:

colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
    1          2         2          2          2
    2          2         2          2          2
    3          1         1          1          0
    4          2         1          2          1
    5          2         1          2          1
    6          0         0          1          0

Note: str(OriginalDataframe)

'data.frame':   1500 obs. of  10000 variables:
 $ someword                                                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ anotherword                                             : num  0 0 0 0 0 0 0 0 0 0 ...

And on a smaller version (which was terminated) of the OriginalDataframe treated with ddply(datfra, .(colID), numcolwise(sum)) I get:

     colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1     0019          0         0          0          0
NA    <NA>         NA        NA         NA         NA
NA.1  <NA>         NA        NA         NA         NA
NA.2  <NA>         NA        NA         NA         NA
NA.3  <NA>         NA        NA         NA         NA

@rawr Hello, again :) Thank you much! I expanded slightly the example, maybe you get the errors now. However, the problem may be related to `ddply` (it may not work so efficiently). Is it possible to avoid `ddply`? — alex, Feb 24 '14 at 02:32
I copied all your code and it runs fine. clear your workspace and run it again. also, `plyr` is for data frames and data frames are not matrices. there are some packages with methods for working with sparse matrices, `Matrix` and `SparseM` I think. I don't use them, so I can't point you to relevant functions, however. — rawr, Feb 24 '14 at 02:42
@rawr Thank you very much! I need to operate on dataframes (changed title). But the Idea is good. I will try a workaround on matrices! — alex, Feb 24 '14 at 02:47
`ddply` is not the solution if you want to work with `data.frames` efficiently. Look at `dplyr` or `data.table` instead. — mnel, Feb 24 '14 at 02:49
@mnel Thank you much for your help. I dont know how to do it. I tried 4h and I never get what I need. Please try to give me some code. Thank you! — alex, Feb 24 '14 at 07:12
Hello Please it is really pressing. I tried `aggregate`, `ddply`, `data.table` id dosent work properly. Some are giving wired errors. Some are transfroming `0`s to `NA`s. I really need a fast answer! Thank you! — alex, Feb 24 '14 at 07:53
your code still works fine for me. post the `str` of your data. Also, restart your r session and run your code again — rawr, Feb 24 '14 at 14:37

score 2 · Accepted Answer · edited May 23 '17 at 11:57

2

Take a look at my answer to this question: Mean per group in a data.frame

Your question is similar. If you change the function being applied from mean to sum, you get what you are looking for.

colstosum <- names(mydt)[2:5]
mydt.sum <- mydt[,lapply(.SD,sum,na.rm=TRUE),by=colID,.SDcols=colstosum]

mydt.sum
   colID colSparse1 cPlSpars2 coMSparse3 colSpArseN
1:     1          2         2          2          2
2:     2          2         2          2          2
3:     3          1         1          1          0
4:     4          2         1          2          1
5:     5          2         1          2          1
6:     6          0         0          1          0

Granted, I can't guarantee the speed or lack thereof of sum on a large data.table. Also, there is a way you should be able to incorporate colSums in the lapply function, but I can't figure out the syntax at the moment.

edited May 23 '17 at 11:57

Community

1
1

answered Feb 25 '14 at 04:15

duHaas

428
2
11

Thank you much for your answer. On my system it dosen't works. In an second attempt I also load `library(data.table)` but nothing changed. For `colstosum <- names(mydt[,2:5,with=F])` I get `Error in '[.data.frame'(mydt, , 2:5, with = F) : unused argument (with = F)` and for the second line I get: `Error in '[.data.frame'(mydt, , lapply(.SD, sum, na.rm = TRUE), by = colID, : unused arguments (by = colID, .SDcols = colstosum)`. – alex Feb 25 '14 at 08:48
Sounds like you didn't convert `mydt` to a data.table before trying to run data.table commands on it. Do `mydt <- data.table(mydt)` first. – fabians Feb 25 '14 at 10:59

How to collapse very large sparse dataframes

1 Answers1