3

I have a big data frame (2628x670316) over 3 GB and want to use the sum function on every row.

The data file looks something like this composed of only 0s, 1s and 2s.

0 1 2 0 0 0 0 0 0 1 1 1 ...
0 1 0 0 0 0 2 2 2 2 2 2 ...
.
.
.

When I run sum(data[1,]==0) this takes ages. Is there a faster way to do it?

Thank you in advance.

PS. The reason I want to use sum is because I want to get the percentage of 0s, 1s and 2s on each row. If there is another way to do that, that answer would also be helpful.

Javier2013
  • 483
  • 1
  • 5
  • 13
  • For a simple row sums you could just use `Reduce("+", data)` it will be both faster and more memory efficient than `rowSums` IMO – David Arenburg Nov 25 '14 at 11:21

3 Answers3

3

If df is your data.frame :

t(apply(df,1,table))*100/ncol(df)

will give you percentages of 0s, 1s and 2s for each row.

(And you avoid making comparisons, which can take a very long time...)

data :

set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500,T),nrow=10))

t(apply(df,1,table))*100/ncol(df) gives you :

       0  1  2
 [1,] 34 44 22
 [2,] 38 40 22
 [3,] 28 34 38
 [4,] 26 38 36
 [5,] 36 42 22
 [6,] 30 32 38
 [7,] 42 26 32
 [8,] 30 36 34
 [9,] 36 24 40
[10,] 24 34 42

EDIT thanks to @akrun 's comment :

In case all possible values (0, 1, 2) are not represented on every row, you have to do :

t(apply(df, 1, function(x) table(factor(x, levels=0:2))))*100/ncol(df)
Cath
  • 23,906
  • 5
  • 52
  • 86
  • Would be nice if the OP could provide metrics how faster this is – ECII Nov 25 '14 at 11:03
  • @akrun : 10 sec for 100x50000 – Cath Nov 25 '14 at 11:06
  • 1
    `system.time(rowSums(!data))# user system elapsed 0.116 0.007 0.123` on a 100X50000 dataset – akrun Nov 25 '14 at 11:10
  • It tried it but got a 'non-numeric argument to binary operator'. Is this an error of my data frame? – Javier2013 Nov 25 '14 at 11:21
  • @Javier2013 can you give us a snippet of your data? or a str(). – ECII Nov 25 '14 at 11:23
  • This is for a smaller data set that is exactly the same as the bigger file: Classes ‘data.table’ and 'data.frame': 820 obs. of 9852 variables: $ V1 : int 1 1 1 1 0 0 0 0 0 0 ... All variables are integers either 0s, 1s, and 2s, although not every row has all numbers. – Javier2013 Nov 25 '14 at 11:27
  • 1
    Tried running apply(df,1,table) and it worked! Looks like there is a problem when trying to get percentages from the table. – Javier2013 Nov 25 '14 at 11:37
  • @Javier2013, it is weird that you can't get the percentages. Can you show how the output of `apply(df,1,table)` looks like ? – Cath Nov 25 '14 at 11:57
  • @Javier2013 Try `t(apply(df, 1, function(x) table(factor(x, levels=0:2))))*100/ncol(df)` I was able to get the error using `set.seed(24); df <- as.data.frame(matrix(sample(0:2, 10*5, replace=TRUE), ncol=5))` – akrun Nov 25 '14 at 12:18
  • 1
    thanks @akrun, I edited my answer to take into account data.frames with rows lacking one or more of possible values – Cath Nov 25 '14 at 12:26
  • 1
    @Javier2013 Testing it on `100x50000` dataset, I got `#user system elapsed # 2.364 0.000 2.362` – akrun Nov 25 '14 at 12:26
  • @Javier2013, you're welcome, I'm happy if I helped you solve your problem :-) – Cath Nov 25 '14 at 13:28
2

If the data are all integers, then it's much faster to represent it as a matrix m (this is also semantically closer to what the data actually is -- a rectangular collection of data with homogeneous type, rather than of columns of possibly different type), maybe input using scan(). With a matrix, column operations are faster than row operations, so transpose it with t(m). The tabulate() function is much faster than table(), though a little more finicky in the present case

nonZeroCounts <- apply(t(m), 2, tabulate, max(m))

In more detail, here are proposed solutions

f0 <- function(df)
    t(apply(df, 1, table))

f1 <- function(m) {
    n <- t(apply(t(m), 2, tabulate, max(m)))
    ans <- cbind(ncol(m) - as.integer(rowSums(n)), n)
    colnames(ans) <- 0:max(m)
    ans
}

some data

nrow <- 100; ncol <- floor(nrow * 670316 / 2628)
m <- matrix(sample(0:2, nrow * ncol, TRUE), nrow=nrow)
df <- as.data.frame(m)

and basic comparison

> system.time(ans0 <- f0(df))
   user  system elapsed 
  1.082   0.000   1.083 
> system.time(ans1 <- f1(m))
   user  system elapsed 
  0.052   0.000   0.052 
> identical(ans0, ans1)
[1] TRUE

or with nrow=1000

> system.time(ans1 <- f1(m))
   user  system elapsed 
  6.521   1.461   7.984 
> system.time(ans0 <- f0(df))   ## argh, boring, stop after 1.5 minutes!
  C-c C-c
Timing stopped at: 93.608 2.752 96.325 
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
0

try rowSums, perhaps it's faster

test<-data.frame(V1=c(1,1,1,1), V2=c(2,2,2,0)) 
rowSums(test)

I doubt however you can get faster sum functions than the vanilla sum.

Another way to get the sums is the notorious apply function family

apply(test, 1, sum)

Did some tests and rowSums is quite fast

set.seed(13)
df<-data.frame(matrix(sample(c(0,1,2),500000000,T),nrow=2000))
system.time(rowSums(df))

system.time(rowSums(df))
   user  system elapsed 
   8.00    0.68    8.69

Whereas for apply

system.time(apply(df, 1, sum))


   user  system elapsed 
  81.67    5.99   87.96 
Community
  • 1
  • 1
ECII
  • 10,297
  • 18
  • 80
  • 121
  • 1
    Doesn't rowSums creates a matrix, that is copying first 3GB of data? – Javier2013 Nov 25 '14 at 11:02
  • probably. OP was unclear what his final intention was until the last edit of the question – ECII Nov 25 '14 at 11:04
  • @Javier2013 it does not seem that rowSums does any copying since data creation takes more time than rowSums. I was suprized too. Perhaps a more knowledgable member can give us some feedback. – ECII Nov 25 '14 at 11:22
  • @ECII It would be better to also include the comparisons between `t(apply(df,1,table))*100/ncol(df)` and the equivalent using `rowSums` – akrun Nov 25 '14 at 11:44
  • @Javier2013 `apply` first coerces a `data.frame` to a matrix too! – Simon O'Hanlon Nov 25 '14 at 13:37