Turn a list of lists with unnamed entries into a data frame or a tibble

Question

I'm using the reticulate R package from RStudio to run some python code to bring data from ROOT (http://root.cern.ch) into R. My problem is that the python code returns a list of row-wise lists. For example, in python,

[[0L, 0L, 'mu+', 1, 0, 0, 1, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -2751.857298366544, 1.2318766603937736, 1407.9560948453036, 3092.931322317615], 
[0L, 0L, 'nu_e', 3, 1, 0, 0, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -743.6755000649275, 9.950229845741603, 342.4203222294634, 818.781981693865], 
[0L, 0L, 'anti_nu_mu', 2, 1, 0, 0, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -808.1114666690765, 21.680955968349267, 445.2784282520303, 922.9231198102832],
...]

These data get turned into a corresponding list of lists in R by reticulate,

List of 136972
$ :List of 14
..$ : int 0
..$ : int 0
..$ : chr "mu+"
..$ : int 1
..$ : int 0
..$ : int 0
..$ : int 0
..$ : num 7162
..$ : num -0.0108
..$ : num -627
..$ : num 264
..$ : num -3.24
..$ : num 3080
..$ : num 3093
$ :List of 14
..$ : int 0
..$ : int 0
..$ : chr "mu+"
..$ : int 1
.... (you get the idea)

I've searched everywhere I can think of, and I cannot find a way to turn these data into a data frame (I really want a tibble). One problem seems to be that the list entries are not named. There's a lot of data, and so I don't want to do something inefficient. I can have the python code return a dictionary of columns and that will work. But the python code to make a row is so much simpler.

If there was an easy way to turn these row-wise lists into a data frame, that would be ideal. Any ideas?

Maybe something like `as.data.frame(lapply(1:14, function(x) sapply(LL, function(y) y[[x]])), col.names = paste0("V", 1:14))` (where `LL` is your `list` of 136972 values. — A5C1D2H2I1M1N2O1R2T1, Mar 07 '17 at 08:47
Thanks! That works and isn't too slow. I had tried, df <- as.data.frame(do.call(rbind, myList), col.names=colnames) but then I end up with a data frame where the columns are lists. Is there a way to make that work? — Adam, Mar 07 '17 at 14:12

score 2 · Accepted Answer · answered Mar 10 '17 at 07:48

Here are a couple of approaches that came to mind:

Option 1: We know how many items are in the sub-lists (how many columns are expected). Cycle through the list to make a new list with each relevant element from the sub-lists. Wrap that in as.data.frame and you're done.

myFun_1 <- function(inlist, expectedCols = 14) {
  as.data.frame(
    lapply(sequence(expectedCols), 
           function(x) {
             sapply(inlist, function(y) y[[x]])
            }),
    col.names = paste0("V", sequence(expectedCols)))
}

Option 2. Use do.call(rbind, .) and then unlist each column to make a regular data.frame with no list columns.

myFun_2 <- function(inlist) {
  x <- as.data.frame(do.call(rbind, inlist))
  x[] <- lapply(x, unlist)
  x
}

Let's test these out with some sample data. Here's a list that should create a rectangular 3 row x 14 column dataset:

LL <- list(
  list(0L, 0L, 'mu+', 1, 0, 0, 1, 3231.6421853545253, -17.361063509909364,
       6322.884067996471, -2751.857298366544, 1.2318766603937736, 
       1407.9560948453036, 3092.931322317615),
  list(0L, 0L, 'nu_e', 3, 1, 0, 0, 3231.6421853545253, -17.361063509909364,
       6322.884067996471, -743.6755000649275, 9.950229845741603, 
       342.4203222294634, 818.781981693865),
  list(0L, 0L, 'anti_nu_mu', 2, 1, 0, 0, 3231.6421853545253, 
       -17.361063509909364, 6322.884067996471, -808.1114666690765, 
       21.680955968349267, 445.2784282520303, 922.9231198102832))

Here's a bigger version of this, which would create a 150000 row by 14 column dataset.

Big_LL <- unlist(replicate(50000, LL, FALSE), FALSE)

Outcomes of each function on the small dataset:

myFun_1(LL)
##   V1 V2         V3 V4 V5 V6 V7       V8        V9      V10        V11       V12
## 1  0  0        mu+  1  0  0  1 3231.642 -17.36106 6322.884 -2751.8573  1.231877
## 2  0  0       nu_e  3  1  0  0 3231.642 -17.36106 6322.884  -743.6755  9.950230
## 3  0  0 anti_nu_mu  2  1  0  0 3231.642 -17.36106 6322.884  -808.1115 21.680956
##         V13       V14
## 1 1407.9561 3092.9313
## 2  342.4203  818.7820
## 3  445.2784  922.9231

myFun_2(LL)
##   V1 V2         V3 V4 V5 V6 V7       V8        V9      V10        V11       V12
## 1  0  0        mu+  1  0  0  1 3231.642 -17.36106 6322.884 -2751.8573  1.231877
## 2  0  0       nu_e  3  1  0  0 3231.642 -17.36106 6322.884  -743.6755  9.950230
## 3  0  0 anti_nu_mu  2  1  0  0 3231.642 -17.36106 6322.884  -808.1115 21.680956
##         V13       V14
## 1 1407.9561 3092.9313
## 2  342.4203  818.7820
## 3  445.2784  922.9231

All looking good. Now, how about performance?

system.time(myFun_1(Big_LL))
##    user  system elapsed 
##    2.65    0.05    2.75 

system.time(myFun_2(Big_LL))
##    user  system elapsed 
##    0.41    0.00    0.40

So, go with the second approach ;-)

Thank you so much for this answer. I was banging my head off a wall trying to do this with str_split output. Still have to see how it works, as I'm more au fait with dplyr et al. but it does. — astaines, Apr 22 '23 at 20:53

Turn a list of lists with unnamed entries into a data frame or a tibble

1 Answers1

Linked