2

I'm trying to replace NA in matrix - mat - by zeros. I'm using mat[is.na(mat)] <- 0. When I have matrix of 94531 observations of 18946 variables or smaller it works good but I try it on matrix of 112039 observations of 22752 variables, R shows an error:

Error in if (!nreplace) return(x) : missing value where TRUE/FALSE needed
In addition: Warning message:
In sum(i, na.rm = TRUE) : integer overflow - use sum(as.numeric(.))

I don't know what I'm doing wrong and I don't understand the error.

Here is an example of the structure of my data.

small data.matrix: (made from real data source)

> str(mat)
Classes 'data.table' and 'data.frame':  94531 obs. of  18946 variables:
 $ 6316506: num  1 0 NA NA NA NA NA NA NA NA ...
 $ 6794602: num  0 1 NA NA NA NA NA 0 0 0 ...
 $ 1008667: num  NA NA 0 1 0 NA NA 0 0 0 ...
 $ 6312454: num  NA NA 1 0 0 NA NA 0 0 0 ...
 $ 8009082: num  NA NA 0 0 1 NA NA NA NA NA ...
 $ 1023293: num  NA NA NA NA NA 1 NA NA NA NA ...
 $ 6740421: num  NA NA NA NA NA 1 NA 0 0 0 ...
 $ 6777805: num  NA NA NA NA NA NA 1 NA NA NA ...
 $ 1000558: num  NA NA NA NA NA NA NA 0 0 0 ...
 $ 1001682: num  NA NA NA NA NA NA NA 0 0 0 ...

the bigger looks exactly the same.

Other question:

is there some way how to use rbindlist(data, fill=T) and fill with zeros instead of NAs?

lmo
  • 37,904
  • 9
  • 56
  • 69
  • 2
    Can you make a reproducible example? – Roman Luštrik Aug 23 '17 at 12:31
  • Try `str(mat1)` and `str(mat2)` where mat1 is your first matrix which works and mat2 is second one. I suspect that some value isn't allowed somewhere ind second large matrix. – Adamm Aug 23 '17 at 12:59
  • @RomanLuštrik I'm not suer what you mean with "reproducible example" – Martina Zapletalová Aug 23 '17 at 13:00
  • @MartinaZapletalová Have a look [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Steven Beaupré Aug 23 '17 at 13:02
  • @StevenBeaupré I found it before you wrote it here, but I'm not still sure what should I do. Because my dataset is realy big (like milions elements) and if I use something like `dput(head(mat,1))` then it has many rows. So I'm not sure if it helps. – Martina Zapletalová Aug 23 '17 at 13:08
  • Maybe related to [this](https://stackoverflow.com/questions/8804779/what-is-integer-overflow-in-r-and-how-can-it-happen) ? – Steven Beaupré Aug 23 '17 at 13:31
  • 1
    You can always simulate enough data. Something along the lines of `matrix(rnorm(94531*18946), nrow = 94531)`. – Roman Luštrik Aug 23 '17 at 13:40
  • @RomanLuštrik that's right but it takes a lot of time (maybe just on my computer) – Martina Zapletalová Aug 23 '17 at 14:01
  • I add other question (rbindlist), which will also solve my problem, because I made `mat` with `rbindlist` : `mat <- (rbindlist(data.inv, fill=T)) ` `mat[is.na(mat)] <- 0` `mat <- data.matrix(mat)` – Martina Zapletalová Aug 23 '17 at 14:08
  • 1
    You are working with a `data.table` NOT a matrix. These are different objects and it is important to note the distinction as different solutions/efficiencies arise depending on the object type. – lmo Aug 23 '17 at 14:10
  • Here, for this data.table, `dt[, names(dt) := lapply(.SD, function(x) {x[is.na(x)] <- 0; x})]` will return the desired values, but I'm not sure if it will run out of memory in the process. My intuition is that the `lapply` is making a copy of the data, which may cause you to run out of memory, but it's worth a try. – lmo Aug 23 '17 at 14:20

1 Answers1

6

With a large data.table, the set function is usually the way to go for replacement within variables.

In this application, you can get your desired outcome in two steps.

  1. Find the locations of NAs for each variable and return a list.
  2. Use data.table's set function to replace the values.

I constructed a data.table as a reproducible example.

set.seed(1234)
dt <- data.table(matrix(sample(c(NA, rnorm(4)), replace=TRUE, 50), 10))
This looks like
dt
            V1         V2         V3         V4         V5
 1:  1.0844412         NA -2.3456977 -2.3456977 -1.2070657
 2:  0.2774292 -1.2070657         NA -2.3456977  1.0844412
 3:  1.0844412 -1.2070657  0.2774292  0.2774292         NA
 4:  0.2774292 -1.2070657 -1.2070657  1.0844412 -1.2070657
 5: -1.2070657         NA -1.2070657 -1.2070657  1.0844412
 6: -2.3456977         NA  0.2774292  1.0844412  0.2774292
 7: -1.2070657 -1.2070657         NA -1.2070657         NA
 8: -2.3456977 -2.3456977  1.0844412  0.2774292  0.2774292
 9: -1.2070657  0.2774292 -1.2070657  1.0844412  0.2774292
10: -1.2070657 -2.3456977 -1.2070657  0.2774292  1.0844412

The first step is to find the NAs for each column.

myNAs <- lapply(dt, function(x) which(is.na(x)))

Next, use a for loop to iterate over the columns and fill in the NA values with the super efficient set function after checking that the column contains missing values with if.

for(j in seq_along(dt)) if(length(myNAs[[j]]) > 0) set(dt, myNAs[[j]], j, 0)

set performs the replacement "in place" (without any copies), so following this operation, the data.table dt has the former NAs replaced with 0s.

dt
            V1         V2         V3         V4         V5
 1:  1.0844412  0.0000000 -2.3456977 -2.3456977 -1.2070657
 2:  0.2774292 -1.2070657  0.0000000 -2.3456977  1.0844412
 3:  1.0844412 -1.2070657  0.2774292  0.2774292  0.0000000
 4:  0.2774292 -1.2070657 -1.2070657  1.0844412 -1.2070657
 5: -1.2070657  0.0000000 -1.2070657 -1.2070657  1.0844412
 6: -2.3456977  0.0000000  0.2774292  1.0844412  0.2774292
 7: -1.2070657 -1.2070657  0.0000000 -1.2070657  0.0000000
 8: -2.3456977 -2.3456977  1.0844412  0.2774292  0.2774292
 9: -1.2070657  0.2774292 -1.2070657  1.0844412  0.2774292
10: -1.2070657 -2.3456977 -1.2070657  0.2774292  1.0844412
lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    my `myNAs` list has 9.3Gb so I was afraid it would take lot of time, but it was kinda fast!!! So I hope it's doing realy what I want. Because check this huge data.table is nightmare. But the `lapply` function is quite slow, so I tried `mclapply` but it didn't return list of 18946 elements (only about 7000). So I stay with `lappy. ` – Martina Zapletalová Aug 24 '17 at 09:47
  • The `set` function is the fastest method to modify variables that I know of, though it is limited in what it can do. To check if any value is NA across your data.table after applying `set`, you could use `anyNA(dt, recursive=TRUE)` or `any(unlist(dt[, lapply(.SD, anyNA)]))`. I'm pretty sure the first method will be faster. I'm not sure why the `mclapply` failed, but there are certainly related posts on SO if you are interested in further exploration. – lmo Aug 24 '17 at 11:38