1

I'm using the reticulate R package from RStudio to run some python code to bring data from ROOT (http://root.cern.ch) into R. My problem is that the python code returns a list of row-wise lists. For example, in python,

[[0L, 0L, 'mu+', 1, 0, 0, 1, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -2751.857298366544, 1.2318766603937736, 1407.9560948453036, 3092.931322317615], 
[0L, 0L, 'nu_e', 3, 1, 0, 0, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -743.6755000649275, 9.950229845741603, 342.4203222294634, 818.781981693865], 
[0L, 0L, 'anti_nu_mu', 2, 1, 0, 0, 3231.6421853545253, -17.361063509909364, 6322.884067996471, -808.1114666690765, 21.680955968349267, 445.2784282520303, 922.9231198102832],
...]

These data get turned into a corresponding list of lists in R by reticulate,

List of 136972
$ :List of 14
..$ : int 0
..$ : int 0
..$ : chr "mu+"
..$ : int 1
..$ : int 0
..$ : int 0
..$ : int 0
..$ : num 7162
..$ : num -0.0108
..$ : num -627
..$ : num 264
..$ : num -3.24
..$ : num 3080
..$ : num 3093
$ :List of 14
..$ : int 0
..$ : int 0
..$ : chr "mu+"
..$ : int 1
.... (you get the idea)

I've searched everywhere I can think of, and I cannot find a way to turn these data into a data frame (I really want a tibble). One problem seems to be that the list entries are not named. There's a lot of data, and so I don't want to do something inefficient. I can have the python code return a dictionary of columns and that will work. But the python code to make a row is so much simpler.

If there was an easy way to turn these row-wise lists into a data frame, that would be ideal. Any ideas?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Adam
  • 153
  • 9
  • have you tried `bind_rows`? – Pierre L Mar 07 '17 at 07:21
  • 1
    Maybe something like `as.data.frame(lapply(1:14, function(x) sapply(LL, function(y) y[[x]])), col.names = paste0("V", 1:14))` (where `LL` is your `list` of 136972 values. – A5C1D2H2I1M1N2O1R2T1 Mar 07 '17 at 08:47
  • Thanks! That works and isn't too slow. I had tried, df <- as.data.frame(do.call(rbind, myList), col.names=colnames) but then I end up with a data frame where the columns are lists. Is there a way to make that work? – Adam Mar 07 '17 at 14:12

1 Answers1

2

Here are a couple of approaches that came to mind:

  • Option 1: We know how many items are in the sub-lists (how many columns are expected). Cycle through the list to make a new list with each relevant element from the sub-lists. Wrap that in as.data.frame and you're done.

    myFun_1 <- function(inlist, expectedCols = 14) {
      as.data.frame(
        lapply(sequence(expectedCols), 
               function(x) {
                 sapply(inlist, function(y) y[[x]])
                }),
        col.names = paste0("V", sequence(expectedCols)))
    }
    
  • Option 2. Use do.call(rbind, .) and then unlist each column to make a regular data.frame with no list columns.

    myFun_2 <- function(inlist) {
      x <- as.data.frame(do.call(rbind, inlist))
      x[] <- lapply(x, unlist)
      x
    }
    

Let's test these out with some sample data. Here's a list that should create a rectangular 3 row x 14 column dataset:

LL <- list(
  list(0L, 0L, 'mu+', 1, 0, 0, 1, 3231.6421853545253, -17.361063509909364,
       6322.884067996471, -2751.857298366544, 1.2318766603937736, 
       1407.9560948453036, 3092.931322317615),
  list(0L, 0L, 'nu_e', 3, 1, 0, 0, 3231.6421853545253, -17.361063509909364,
       6322.884067996471, -743.6755000649275, 9.950229845741603, 
       342.4203222294634, 818.781981693865),
  list(0L, 0L, 'anti_nu_mu', 2, 1, 0, 0, 3231.6421853545253, 
       -17.361063509909364, 6322.884067996471, -808.1114666690765, 
       21.680955968349267, 445.2784282520303, 922.9231198102832))

Here's a bigger version of this, which would create a 150000 row by 14 column dataset.

Big_LL <- unlist(replicate(50000, LL, FALSE), FALSE)

Outcomes of each function on the small dataset:

myFun_1(LL)
##   V1 V2         V3 V4 V5 V6 V7       V8        V9      V10        V11       V12
## 1  0  0        mu+  1  0  0  1 3231.642 -17.36106 6322.884 -2751.8573  1.231877
## 2  0  0       nu_e  3  1  0  0 3231.642 -17.36106 6322.884  -743.6755  9.950230
## 3  0  0 anti_nu_mu  2  1  0  0 3231.642 -17.36106 6322.884  -808.1115 21.680956
##         V13       V14
## 1 1407.9561 3092.9313
## 2  342.4203  818.7820
## 3  445.2784  922.9231

myFun_2(LL)
##   V1 V2         V3 V4 V5 V6 V7       V8        V9      V10        V11       V12
## 1  0  0        mu+  1  0  0  1 3231.642 -17.36106 6322.884 -2751.8573  1.231877
## 2  0  0       nu_e  3  1  0  0 3231.642 -17.36106 6322.884  -743.6755  9.950230
## 3  0  0 anti_nu_mu  2  1  0  0 3231.642 -17.36106 6322.884  -808.1115 21.680956
##         V13       V14
## 1 1407.9561 3092.9313
## 2  342.4203  818.7820
## 3  445.2784  922.9231

All looking good. Now, how about performance?

system.time(myFun_1(Big_LL))
##    user  system elapsed 
##    2.65    0.05    2.75 

system.time(myFun_2(Big_LL))
##    user  system elapsed 
##    0.41    0.00    0.40 

So, go with the second approach ;-)

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thank you so much for this answer. I was banging my head off a wall trying to do this with str_split output. Still have to see how it works, as I'm more au fait with dplyr et al. but it does. – astaines Apr 22 '23 at 20:53