5

I have a big data.frame called "mat" of 49952 obs. of 7597 variables and I'm trying to replace NAs with zeros. Here is and example how my data.frame looks like:

    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   NA  NA  0   NA  NA
2   0   0   1   NA  NA  0   NA  NA
3   0   0   0   NA  NA  1   NA  NA
4   NA  NA  NA  NA  NA  NA  NA  NA
5   0   1   0   1   NA  0   NA  NA 
6   1   1   1   0   NA  0   NA  NA
7   0   0   1   0   NA  1   NA  NA 
.
.
.

I need realy fast tool to replace them. The result should look like:

    A   B   C   E   F   D   Q   Z   . . .
1   1   1   0   0   0   0   0   0
2   0   0   1   0   0   0   0   0 
3   0   0   0   0   0   1   0   0
4   0   0   0   0   0   0   0   0
5   0   1   0   1   0   0   0   0 
6   1   1   1   0   0   0   0   0
7   0   0   1   0   0   1   0   0 
.
.
.

I already tried lapply(mat, function(x){replace(x, is.na(x),0)}) - didn't work - mat[is.na(mat)] <- 0 - error and and maybe too slow - and also link - didn't work too.

@Sotos already advised me plyr::rbind.fill(lapply(L, as.data.frame)) but it didn't work, because it makes data.frame of 379485344 observations and 1 variable (which is 49952x7597) so I have to also trafnsform it back. Is there any better way to do this?

The real structure of my data.frame:

> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 NA NA NA NA 0 0 0 0 0 ...
 $ 1008667   : num  NA 1 0 NA NA 0 0 0 0 0 ...
 $ 8009082   : num  NA 0 1 NA NA NA NA NA NA NA ...
 $ 6740421   : num  NA NA NA 1 NA 0 0 0 0 0 ...
 $ 6777805   : num  NA NA NA NA 1 NA NA NA NA NA ...
 $ 1001682   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1001990   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002541   : num  NA NA NA NA NA 0 0 0 0 0 ...
 $ 1002790   : num  NA NA NA NA NA 0 0 0 0 0 ...

Note:

when I tried mat[is.na(mat)] <- 0 there was a warning:

> mat[is.na(mat)] <- 0
Warning messages:
1: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, thisvar, value = 0) :
  invalid factor level, NA generated
> nlevels(mat)
[1] 0

Data.frame mat after using mat[is.na(mat)] <- 0:

> str(mat)
'data.frame':   49952 obs. of  7597 variables:
 $ 6794602   : num  1 0 0 0 0 0 0 0 0 0 ...
 $ 1008667   : num  0 1 0 0 0 0 0 0 0 0 ...
 $ 8009082   : num  0 0 1 0 0 0 0 0 0 0 ...
 $ 6740421   : num  0 0 0 1 0 0 0 0 0 0 ...
 $ 6777805   : num  0 0 0 0 1 0 0 0 0 0 ...
 $ 1001682   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1001990   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002541   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ 1002790   : num  0 0 0 0 0 0 0 0 0 0 ...

So the questions are:

  1. Is there any other fast way to replace the NA?
  2. Is the warning big deal? Because data after using mat[is.na(mat)] <- 0 looks like what I want, but there are too many values, so I can't check if they are all right.
  • `mat[is.na(mat)] = 0` should be the fastest way, hands down (on dense matrices). If it isn’t, that’s a glaring bug in R … – Konrad Rudolph Aug 08 '17 at 17:08
  • 1
    That's a warning not an error, and it explains well what is happening, right? If you're surprised that the data has factors, maybe try `View(mat[sapply(mat, is.factor)])` or maybe `str` instead of `View` there. – Frank Aug 08 '17 at 17:11
  • @Frank Well the question contains the output of `str(mat)` and there are no factors. But the warning message simply doesn’t fit that output. – Konrad Rudolph Aug 08 '17 at 17:19
  • @Konrad OP truncated the str output. Try `str(as.data.frame(replicate(7597, 1, simplify=FALSE)))` -- first, OP showed us less than they saw; second, even the full displayed output won't show all 7597 columns. Anyway, we cannot say for sure when OP only provides glimpses of their data instead of a good example... – Frank Aug 08 '17 at 17:40
  • @Frank @KonradRudolph I think there shouldn't be any factor I make this data.frame from: `'data.frame': 199235 obs. of 3 variables: $ Invoice_Date: Factor w/ 627 levels $ SKU : Factor w/ 53113 levels $ CustomerID : Factor w/ 55945 levels` where I split it into 627 data frames accoring to Invoice_Date and use droplevels to simpler computation and then I made frequency data frames of SKU in columns and CustomerID in rows and then I use `mat <- rbindlist(cop.data1, fill=T)` to put it back together (I don't need CusotmerID) and I get the data.frame **mat** – Martina Zapletalová Aug 08 '17 at 20:55
  • Edit: I found out two factors! But it was like what, why?! – Martina Zapletalová Aug 08 '17 at 21:31

3 Answers3

10

Try the following:

mat %>% replace(is.na(.), 0)
Sagar
  • 2,778
  • 1
  • 8
  • 16
  • It takes longer than `mat[is.na(mat)] <- 0` but maybe I will let it run overnight to find out if there will be also the warning. Edit: it takes longer and again - warning. So as I wrote above, I think I will ignore the warning. – Martina Zapletalová Aug 08 '17 at 21:04
  • Can you check the following issue to understand that warning message? Might help resolve. https://stackoverflow.com/a/16820025/8382207 – Sagar Aug 09 '17 at 13:13
2

If suspect that some of your columns are factor, you can use the following code to detect and change them to numeric.

inx <- sapply(mat, inherits, "factor")
mat[inx] <- lapply(mat[inx], function(x) as.numeric(as.character(x)))

Then try the following.

mat[] <- lapply(mat, function(x) {x[is.na(x)] <- 0; x})
mat

And here's the data.

mat <-
structure(list(A = c(1L, 0L, 0L, NA, 0L, 1L, 0L), B = c(1L, 0L, 
0L, NA, 1L, 1L, 0L), C = c(0L, 1L, 0L, NA, 0L, 1L, 1L), E = c(NA, 
NA, NA, NA, 1L, 0L, 0L), F = c(NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), D = c(0L, 0L, 1L, NA, 
0L, 0L, 1L), Q = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_), Z = c(NA_real_, NA_real_, NA_real_, NA_real_, 
NA_real_, NA_real_, NA_real_)), .Names = c("A", "B", "C", "E", 
"F", "D", "Q", "Z"), row.names = c("1", "2", "3", "4", "5", "6", 
"7"), class = "data.frame")
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • again warning messages: `Warning messages: 1: In `[<-.factor`(`*tmp*`, is.na(x), value = 0) : invalid factor level, NA generated 2: In `[<-.factor`(`*tmp*`, is.na(x), value = 0) : invalid factor level, NA generated` maybe I will just ignore warning, it's not the error as in other cases. – Martina Zapletalová Aug 08 '17 at 20:59
  • @MartinaZapletalová If your output of `str(mat)` is correct that shouldn't happen, since all columns are of class `numeric`. Are you sure that all of those 7597 variables are `numeric`? If two of them are factors you may need to convert them to `numeric` first. – Rui Barradas Aug 08 '17 at 21:04
  • so I tried this: `a <- 0 for (i in 1:ncol(mat)){ if (class(mat[[i]]) == "numeric"){ a <- a+1 } else { a <- a+1 print(i) } }` and found out that there is problem in mat[[2260]] and mat[[2261]] so I looked at it and you were right they are factor but I don't understand how it could happend. – Martina Zapletalová Aug 08 '17 at 21:27
  • @MartinaZapletalová OK, I'll edit my answer with code for you to get rid of the factors in an automated way. – Rui Barradas Aug 09 '17 at 09:41
0

See my detailed answer here.

#install.packages("xlsx")
library(xlsx)
extracted_df <- read.xlsx("test.xlsx", sheetName='Sheet1', stringsAsFactors=FALSE)
# Replace all NAs in a data frame with "G" character
extracted_df[is.na(extracted_df)] <- "G"
ozturkib
  • 1,493
  • 16
  • 28