R - ff package : find the most frequent element in ffdf and delete the rows where is located

Question

I need a suggestion to find the most frequent element in ffdf and after that to delete the rows where is located. I decided to try the ff package as I'm working with very big data and with base R I am running out of memory.

Here is a little example:

 # create a base R Matrix

 > z<-matrix(c("a", "b", "a", "c", "b", "b", "c", "c", "b", "a"),nrow=5,ncol=2,byrow = TRUE)
 > z


     [,1] [,2]
 [1,] "a"  "b" 
 [2,] "a"  "c" 
 [3,] "b"  "b" 
 [4,] "c"  "c" 
 [5,] "b"  "a" 


 # convert z to ffdf

 > u=as.data.frame(z, stringsAsFactors=TRUE)
 > u=as.ffdf(u)
 > u

  ffdf data
   V1 V2
1  a  b
2  a  c
3  b  b
4  c  c
5  b  a

Im looking for:

Export the most frequent element in ffdf (in this case it is "b")
Delete from ffdf all the rows where "b" is located

So, the new ffdf must be as below:

   V1 V2
1  a  c
2  c  c

In base R I found the way with the "table" function

  temp <- table(as.vector(z))  
  t1<-names(temp)[temp == max(temp)] 
  z1<- z[rowSums(z== t1[1]) == 0, ]

But working with huge data I need something like the ff package.

score 1 · Accepted Answer · 2015-06-01T11:27:42.747

require(ff)
z <- matrix(c("a","b","f","c","f","b","e","c","b","e"),nrow=5,ncol=2,byrow = TRUE)
u <- as.data.frame(z, stringsAsFactors=TRUE)
u <- as.ffdf(u)
u

The following should work on any sized dataset. It uses table.ff and ffwhich from ffbase, ffrowapply from ff and indexing based on ff integer vectors.

require(ffbase)
require(plyr)
## Detect most frequent item (assuming the levels of all columns can be different)
columnfreqs <- lapply(colnames(u), FUN=function(column) table(u[[column]]))
columnfreqs <- lapply(columnfreqs, FUN=function(x) as.data.frame(t(as.matrix(x))))
itemfreqs <- colSums(do.call(rbind.fill, columnfreqs), na.rm=TRUE)
mostfrequent <- names(sort(itemfreqs, decreasing = TRUE))[1]

## Identify the lines where the most frequent item occurs in each row of the ffdf 
idx <- ffrowapply(
  EXPR = apply(u[i1:i2,], MARGIN=1, FUN=function(row) any(row %in% mostfrequent)), 
  X=u, 
  RETURN = TRUE, FF_RETURN = TRUE, RETCOL = NULL, VMODE = "logical")
idx <- ffwhich(idx, idx != TRUE) # remove it is in there + convert logicals to integers

## Remove them
u[idx, ]

R - ff package : find the most frequent element in ffdf and delete the rows where is located

1 Answers1