Using apply on large ffdfs

Question

The basic idea is this: I have a large ffdf (about 5.5 million x 136 fields). I know for a fact that some of these columns in this data frame have columns which are all NA. How do I find out which ones and remove them appropriately?

My instinct is to do something like (assuming df is the ffdf):

apply(X=is.na(df[,1:136]), MARGIN = 2, FUN = sum)

which should give me a vector of the NA counts for each column, and then I could find which ones have ~5.5 million NA values, remove them using df <- df[,-c(vector of columns)], etc. Pretty straightforward.

However, apply gives me an error.

Error: cannot allocate vector of size 21.6 Mb
In addition: Warning messages:
1: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
2: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
3: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)
4: In `[.ff`(p, i2) :
  Reached total allocation of 3889Mb: see help(memory.size)

This tells me that apply can't handle a data frame of this size. Are there any alternatives I can use?

How about just `df[, !colSums(is.na(df[,1:136]))==nrow(df)]` ? — zx8754, Dec 01 '15 at 14:43
@zx8754 Do you mean `colSums` instead (see my previous edit)? Either way, I still get the same error. — Clarinetist, Dec 01 '15 at 14:49
@zx8754 However, I would be interested in getting the list of columns that are removed as well, but I suppose that wouldn't be too difficult.. just compare the `names`. — Clarinetist, Dec 01 '15 at 14:52

Jan van der Laan · Accepted Answer · 2015-12-01T15:15:21.243

1

It is easier to use all(is.na(column)). sapply/lapply donot work because and ffdf object is not a list.

You use df[, 1:136] in your code. This will cause ff to try to load all 136 columns into memory. This is what causes the memory issues. This does not happen when you do df[1:136]. The same happens when indexing for the final result: df <- df[,-c(vector of columns)] reads all selected columns into memory.

na_cols <- logical(136)
for (i in seq_len(136)) {
  na_cols[i] <- all(is.na(df[[i]]))
}

res <- df[!na_cols]

edited Dec 01 '15 at 15:15

answered Dec 01 '15 at 14:48

Jan van der Laan

8,005
1
20
35

Similar error when declaring `cols`: `Warning messages: 1: In ff::`[.ff`(x = x, i = i, pack = pack) : Reached total allocation of 3889Mb: see help(memory.size) 2: In ff::`[.ff`(x = x, i = i, pack = pack) : Reached total allocation of 3889Mb: see help(memory.size)` – Clarinetist Dec 01 '15 at 14:51
@Clarinetist I saw the discussion in the comments under your question and modified my answer. Your error is caused because your code reads the complete data set into memory. – Jan van der Laan Dec 01 '15 at 14:55
`cols` worked, `df[, !cols]` gave me a similar error. – Clarinetist Dec 01 '15 at 15:03
I'm not sure if this was the intent, but `cols` only has two values: under `virtual`: `FALSE`, and under `physical`: `FALSE`. – Clarinetist Dec 01 '15 at 15:04
@Clarinetist You are right. `sapply` doesn't work. The toy example I used to test the code worked because I accidentally happened to have the same number of columns as returned by `sapply`. My new code should work not. – Jan van der Laan Dec 01 '15 at 15:16
Very nice. Thank you! – Clarinetist Dec 01 '15 at 15:27

score 0 · Answer 2 · answered Dec 01 '15 at 15:04

0

Try this example:

#dummy data
df <- sample(1000000*5)
df <- data.frame( matrix(df,nrow = 1000000))
df$X3 <- NA
df$X6 <- NA

#list of col to remove or keep
colToRemove <- colnames(df)[ colSums(is.na(df[ ,1:6])) == nrow(df) ]
colToKeep <- setdiff(colnames(df), colToRemove)

#subset
res <- df[, colToKeep]

colnames(df)
#[1] "X1" "X2" "X3" "X4" "X5" "X6"
colnames(res)
#[1] "X1" "X2" "X4" "X5"

answered Dec 01 '15 at 15:04

zx8754

52,746
12
114
209

Works fine on my end. – Clarinetist Dec 01 '15 at 15:06

Using apply on large ffdfs

2 Answers2