42

I quite often come across data that is structured something like this:

employees <- list(
    list(id = 1,
             dept = "IT",
             age = 29,
             sportsteam = "softball"),
    list(id = 2,
             dept = "IT",
             age = 30,
             sportsteam = NULL),
    list(id = 3,
             dept = "IT",
             age = 29,
             sportsteam = "hockey"),
    list(id = 4,
             dept = NULL,
             age = 29,
             sportsteam = "softball"))

In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern.

I would like to turn the list into a dataframe but if I run:

library(data.table)
employee.df <- rbindlist(employees)

I get errors because of the NULL values. My normal strategy is to use a function like:

nullToNA <- function(x) {
    x[sapply(x, is.null)] <- NA
    return(x)
}

and then:

employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)

which returns

   id dept age sportsteam
1:  1   IT  29   softball
2:  2   IT  30         NA
3:  3   IT  29     hockey
4:  4   NA  29   softball

However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach.

One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go).

Any advice on how to do this operation efficiently on a large dataset?

Jon M
  • 1,157
  • 1
  • 10
  • 16
  • 2
    have you tried do.call with rbind? like so `employee.df<-do.call("rbind",employees)` – infominer Apr 04 '14 at 18:24
  • Does the original data actually say "NULL" if null, or is it just empty there? – Rich Scriven Apr 04 '14 at 18:24
  • The original data has NULL values. It was generated by scraping JSON files and converting them through rjson. – Jon M Apr 04 '14 at 18:32
  • In a general case, if the original (scraped) dataset is already a data.frame (e.g., from `XML::readHTMLTable()`), and the NULL cells are simply 0-length character strings, use the following: `df <- data.frame(apply(df, c(1,2), FUN=function(x) ifelse(x=="",NA,x)))` – Brian D Apr 18 '18 at 18:40

7 Answers7

25

Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. Usually, this is matrix form.

If you bring all the data together with rbind, your nullToNA function no longer has to search though nested lists, and therefore sapply serves its purpose (looking though a matrix) more efficiently. In theory, this should make the process faster.

Good question, by the way.

> dat <- do.call(rbind, lapply(employees, rbind))
> dat
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NULL      
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NULL 29  "softball"

> nullToNA(dat)
     id dept age sportsteam
[1,] 1  "IT" 29  "softball"
[2,] 2  "IT" 30  NA        
[3,] 3  "IT" 29  "hockey"  
[4,] 4  NA   29  "softball"
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • This is certainly faster than my current approach. It's a shame to lose the speed of rbindlist for the concatenation though, as that is a lot faster than do.call(rbind, employees). That might just be the tradeoff here though. – Jon M Apr 04 '14 at 18:30
  • In that case why don't you do this `lapply(employees, function(x) ifelse(x == "NULL", NA, x))` and then use `rbindlist` – infominer Apr 04 '14 at 18:32
  • 2
    This would be very neat, but the resulting columns (after conversion to data.frame) are lists, which creates problems. See `dat = data.frame(dat); dat[,1]`.. – geotheory Jul 12 '16 at 22:38
  • 1
    Everyone, this is a really bad answer. Please don't use it. @JonM - if you could, please remove the checkmark so I can delete it. – Rich Scriven Mar 13 '19 at 20:15
6

A tidyverse solution that I find easier to read is to write a function that works on a single element and map it over all of your NULLs.

I'll use @rich-scriven's rbind and lapply approach to create a matrix, and then turn that into a dataframe.

library(magrittr)

dat <- do.call(rbind, lapply(employees, rbind)) %>% 
  as.data.frame()

dat
#>   id dept age sportsteam
#> 1  1   IT  29   softball
#> 2  2   IT  30       NULL
#> 3  3   IT  29     hockey
#> 4  4 NULL  29   softball

Then we can use purrr::modify_depth() at a depth of 2 to apply replace_x()

replace_x <- function(x, replacement = NA_character_) {
  if (length(x) == 0 || length(x[[1]]) == 0) {
    replacement
  } else {
    x
  }
}

out <- dat %>% 
  purrr::modify_depth(2, replace_x)

out
#>   id dept age sportsteam
#> 1  1   IT  29   softball
#> 2  2   IT  30         NA
#> 3  3   IT  29     hockey
#> 4  4   NA  29   softball
amanda
  • 321
  • 5
  • 12
5

A two step approach creates a dataframe after combing it with rbind:

employee.df<-data.frame(do.call("rbind",employees))

Now replace the NULL's, I am using "NULL" as R doesn't put NULL when you load the data and is reading it as character when you load it.

employee.df.withNA <- sapply(employee.df, function(x) ifelse(x == "NULL", NA, x))
Richard Schwartz
  • 14,463
  • 2
  • 23
  • 41
infominer
  • 1,981
  • 13
  • 17
2

I often find do.call() functions hard to read. A solution I use daily (with a MySQL output containing "NULL" character values):

NULL2NA <- function(df) {
  df[, 1:length(df)][df[, 1:length(df)] == 'NULL'] <- NA
  return(df)
}

But for all solutions: please remember that NA cannot be used for calculation without na.rm = TRUE, but with NULL you can. NaN gives the same problem. For example:

> mean(c(1, 2, 3))
2

> mean(c(1, 2, NA, 3))
NA

> mean(c(1, 2, NULL, 3))
2

> mean(c(1, 2, NaN, 3))
NaN
MS Berends
  • 4,489
  • 1
  • 40
  • 53
1

All of these solutions (I think) are hiding the fact that the data table is still a lost of lists and not a list of vectors (I did not notice in my application either until it started throwing unexpected errors during :=). Try this:

data.table(t(sapply(employees, function(x) unlist(lapply(x, function(x) ifelse(is.null(x),NA,x))))))

I believe it works fine, but I am not sure if it will suffer from slowness and can be optimized further.

BBB
  • 150
  • 1
  • 11
0

Another option would be to simply map_dfr over the list, which immediately yields the correct result:

> map_dfr(employees, ~ .x)
# A tibble: 4 × 4
     id dept    age sportsteam
  <dbl> <chr> <dbl> <chr>     
1     1 IT       29 softball  
2     2 IT       30 NA        
3     3 IT       29 hockey    
4     4 NA       29 softball  

However, if a column has no non-NULL values, it will be omitted from the output:

> list(list(a = 1, b = NULL, c = 3), list(a = 4, b = NULL, c = 6)) |> 
+   map_dfr(~ .x)
# A tibble: 2 × 2
      a     c
  <dbl> <dbl>
1     1     3
2     4     6
Dmitry Zotikov
  • 2,133
  • 15
  • 12
0

Instead of sapply(x, is.null) match(list(NULL), x) could be used to speed up the conversion from NULL to NA in a list.

lapply(employees, \(x) `[<-`(x, match(list(NULL), x), NA))

Benchmark

bench::mark(
sapply = lapply(employees, \(x) `[<-`(x, sapply(x, is.null), NA)),
match = lapply(employees, \(x) `[<-`(x, match(list(NULL), x), NA)) )
#  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 sapply       82.6µs 88.8µs    11142.    4.13KB    47.1   4971    21      446ms
#2 match        50.7µs   57µs    17272.    4.13KB     6.47  8006     3      464ms
GKi
  • 37,245
  • 2
  • 26
  • 48