124

In R, I have an operation which creates some Inf values when I transform a dataframe.

I would like to turn these Inf values into NA values. The code I have is slow for large data, is there a faster way of doing this?

Say I have the following dataframe:

dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))

The following works in a single case:

 dat[,1][is.infinite(dat[,1])] = NA

So I generalized it with following loop

cf_DFinf2NA <- function(x)
{
    for (i in 1:ncol(x)){
          x[,i][is.infinite(x[,i])] = NA
    }
    return(x)
}

But I don't think that this is really using the power of R.

Eric Leschinski
  • 146,994
  • 96
  • 417
  • 335
ricardo
  • 8,195
  • 7
  • 47
  • 69

12 Answers12

134

Option 1

Use the fact that a data.frame is a list of columns, then use do.call to recreate a data.frame.

do.call(data.frame,lapply(DT, function(x) replace(x, is.infinite(x),NA)))

Option 2 -- data.table

You could use data.table and set. This avoids some internal copying.

DT <- data.table(dat)
invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA)))

Or using column numbers (possibly faster if there are a lot of columns):

for (j in 1:ncol(DT)) set(DT, which(is.infinite(DT[[j]])), j, NA)

Timings

# some `big(ish)` data
dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                  c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                  e = rep(c(Inf,2), 1e6))
# create data.table
library(data.table)
DT <- data.table(dat)

# replace (@mnel)
system.time(na_dat <- do.call(data.frame,lapply(dat, function(x) replace(x, is.infinite(x),NA))))
## user  system elapsed 
#  0.52    0.01    0.53 

# is.na (@dwin)
system.time(is.na(dat) <- sapply(dat, is.infinite))
# user  system elapsed 
# 32.96    0.07   33.12 

# modified is.na
system.time(is.na(dat) <- do.call(cbind,lapply(dat, is.infinite)))
#  user  system elapsed 
# 1.22    0.38    1.60 


# data.table (@mnel)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
# user  system elapsed 
# 0.29    0.02    0.31 

data.table is the quickest. Using sapply slows things down noticeably.

altocumulus
  • 21,179
  • 13
  • 61
  • 84
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 1
    Great work on the timings and the modification @mnel. I wish there were an SO way to transfer rep across accounts. I think I will go out and upvotes some other answers of yours. – IRTFM Aug 30 '12 at 21:09
  • error in do.call(train, lapply(train, function(x) replace(x, is.infinite(x), : 'what' must be a character string or a function – Hack-R Feb 26 '16 at 15:59
  • If you are happy to replace in situ then the following simplification will work: dat[] <- lapply(dat, function(x) replace(x, is.infinite(x),NA)) – Knackiedoo Mar 05 '21 at 23:40
63

Use sapply and is.na<-

> dat <- data.frame(a=c(1, Inf), b=c(Inf, 3), d=c("a","b"))
> is.na(dat) <- sapply(dat, is.infinite)
> dat
   a  b d
1  1 NA a
2 NA  3 b

Or you can use (giving credit to @mnel, whose edit this is),

> is.na(dat) <- do.call(cbind,lapply(dat, is.infinite))

which is significantly faster.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
27

Here is a dplyr/tidyverse solution using the na_if() function:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf)))

Note that this only replaces positive infinity with NA. Need to repeat if negative infinity values also need to be replaced.

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))
Feng Mai
  • 2,749
  • 1
  • 28
  • 33
  • 3
    With the new `across` function, this can now be down in a single `mutate` call: `mutate(across(where(is.numeric), ~na_if(., Inf)), across(where(is.numeric), ~na_if(., -Inf)))` – Paul Wildenhain Jan 27 '22 at 19:07
19

[<- with mapply is a bit faster than sapply.

> dat[mapply(is.infinite, dat)] <- NA

With mnel's data, the timing is

> system.time(dat[mapply(is.infinite, dat)] <- NA)
#   user  system elapsed 
# 15.281   0.000  13.750 
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
7

There is very simple solution to this problem in the hablar package:

library(hablar)

dat %>% rationalize()

Which return a data frame with all Inf are converted to NA.

Timings compared to some above solutions. Code: library(hablar) library(data.table)

dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                  c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                  e = rep(c(Inf,2), 1e6))
DT <- data.table(dat)

system.time(dat[mapply(is.infinite, dat)] <- NA)
system.time(dat[dat==Inf] <- NA)
system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
system.time(rationalize(dat))

Result:

> system.time(dat[mapply(is.infinite, dat)] <- NA)
   user  system elapsed 
  0.125   0.039   0.164 
> system.time(dat[dat==Inf] <- NA)
   user  system elapsed 
  0.095   0.010   0.108 
> system.time(invisible(lapply(names(DT),function(.name) set(DT, which(is.infinite(DT[[.name]])), j = .name,value =NA))))
   user  system elapsed 
  0.065   0.002   0.067 
> system.time(rationalize(dat))
   user  system elapsed 
  0.058   0.014   0.072 
> 

Seems like data.table is faster than hablar. But has longer syntax.

davsjob
  • 1,882
  • 15
  • 10
3

Feng Mai has a tidyverse answer above to get negative and positive infinities:

dat %>% mutate_if(is.numeric, list(~na_if(., Inf))) %>% 
  mutate_if(is.numeric, list(~na_if(., -Inf)))

This works well, but a word of warning is not to swap in abs(.) here to do both lines at once as is proposed in an upvoted comment. It will look like it works, but changes all negative values in the dataset to positive! You can confirm with this:

data(iris)
#The last line here is bad - it converts all negative values to positive
iris %>% 
  mutate_if(is.numeric, ~scale(.)) %>%
  mutate(infinities = Sepal.Length / 0) %>%
  mutate_if(is.numeric, list(~na_if(abs(.), Inf)))

For one line, this works:

  mutate_if(is.numeric, ~ifelse(abs(.) == Inf,NA,.))
Mark E.
  • 373
  • 2
  • 10
  • 1
    Good catch! I've added a comment to this affect on the original comment--I think that's a better place to address the issue than a new answer. Also found some posts of yours worthy of upvotes to get you a little closer to the 50 reputation required to comment anywhere. – Gregor Thomas Apr 28 '20 at 16:49
  • Thanks! Yes I would have left a comment if I'd been able. – Mark E. Apr 28 '20 at 21:30
  • do you know why it doesnt work with if_else instead of ifelse in the last code? – Ian.T Dec 02 '20 at 16:25
2

Inside a dplyr pipe chain, you can do this.

%>% mutate_all(.,.funs = function(x){ifelse(is.infinite(x),NA,x)}) %>%

I find it simple, elegant and fast.

Sagar
  • 151
  • 6
2

There are many answers already, but would like to add that for me this tidyverse solution always worked well:

%>% mutate_all(function(x) ifelse(is.nan(x) | is.infinite(x), NA, x)) %>%
ToWii
  • 590
  • 5
  • 8
0

Another solution:

    dat <- data.frame(a = rep(c(1,Inf), 1e6), b = rep(c(Inf,2), 1e6), 
                      c = rep(c('a','b'),1e6),d = rep(c(1,Inf), 1e6),  
                      e = rep(c(Inf,2), 1e6))
    system.time(dat[dat==Inf] <- NA)

#   user  system elapsed
#  0.316   0.024   0.340
Mus
  • 7,290
  • 24
  • 86
  • 130
Student
  • 23
  • 3
  • MusTheDataGuy, why would you edit my answer but not add your own solution? There is already "add another answer" button! – Student Oct 10 '18 at 16:32
0

Also, if someone need the Infs' coordinates, can do this:

library(rlist)
list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)

Result:

$colname1
[1] row1 row2 ...
$colname2
[2] row1 row2 ... 

With this information, you can replace the Inf values in particular places with the mean, median, or whatever operator that you want.

For example (for element 01):

repInf = list.clean(apply(df, 2, function(x){which(is.infinite(x))}), function(x) length(x) == 0L, TRUE)
df[repInf[[1]], names(repInf)[[1]]] = median or mean(is.finite(df[ ,names(repInf)[[1]]]), na.rm = TRUE)

In loop:

for (nonInf in 1:length(repInf)) {
df[repInf[[nonInf]], names(repInf)[[nonInf]]] = mean(is.finite(df[ , names(repInf)[[nonInf]]]))
}
0

Chiming in, thought this worked well.

infNanReplace <- function (v, r = 0) {
  v[!is.finite(v)] <- r
  return(v)
}
robbieNukes
  • 89
  • 1
  • 12
-1

You may also use the handy replace_na function: https://tidyr.tidyverse.org/reference/replace_na.html

Gang Su
  • 1,187
  • 10
  • 12
  • 2
    This is a borderline [link-only answer](//meta.stackexchange.com/q/8231). You should expand your answer to include as much information here, and use the link only for reference. – Blue Nov 17 '18 at 01:04