2

It's been some time since I've worked in R, so I'm a little rusty and need some help with lists. I have a list which contains 7 elements that pertain to patrons' purchases while visiting a large chain store (a snippet of the list is provided below). Within a given index, elements 1, 2, and 7 are always vectors of length 1 and elements 4, 5, and 6 are always vectors of the same length, but vary from list-index to list index (e.g. from [[74]] to [[75]]). This is because elements 1, 2, and 7 are about a patron's visit to a store, whereas elements 4, 5, and 6 are about their individual purchases during that visit, so there is a one to many relationship between elements 1, 2, 7 (visit) and 3, 4, and 5 (purchases). I am trying to figure out how to efficiently convert my list to a single data frame. The catch is that, when I create the one large data frame, that the visit information be recycled across the purchases within a list index. So for example, in the example data provided below, I'd like to have the order number, date, payment method ("Visa"), be recycled across each of the purchases in the data frame so it looks like this in the end:

73 "Order #: 065-PO-4080219" "Sunday, September 17 2017" "PowerColor Red Dragon Radeon RX 580 Dual-Fan 8GB GDDR5 PCIe Video Card" $329.99 1 "Visa"
73 "Order #: 065-PO-4080219" "Sunday, September 17 2017" "PowerColor Red Dragon Radeon RX 580 Dual-Fan 8GB GDDR5 PCIe Video Card" $329.99 1 "Visa"
73 "Order #: 065-PO-4080219" "Sunday, September 17 2017" "ASUS PRIME Z270-AR LGA 1151 ATX Intel Motherboard"                      $159.99 1 "Visa"
74 "Order #: 065-PO-4079152" "Saturday, September 16 2017" "Olympia Tools Tool Set 53 Piece"                                       $12.99 1 "Visa"
74 "Order #: 065-PO-4079152" "Saturday, September 16 2017" "The Best Connection Cable Ties"                                         $1.99 1 "Visa"

I have been able to accomplish this by first converting each list index into a dataframe and then using:

do.call("rbind", MyListOfDataFrames)

But this seems really inefficient to me (as I understand it, it's generally inefficient to first convert to dataframes and then combine them into a larger data frame).

Is there a way to convert this list into one large dataframe? I have thousands and thousands of these records to process, so I want to make this as efficient as possible. I have placed a small subset of the data on a publicly available site here to download if it helps (this can be loaded with the load() function and the resulting list is called ProductList and I have placed an ASCII representation of the List here using dput() as suggested by a user and I've also placed the raw dput() output at the end of this post). I tried searching the posts on stackoverflow but none really seemed to address this. Thanks for your help.

List snippet

[[73]][[1]]
[1] 73

[73]][[2]]
[1] "Order #: 065-PO-4080219"

[[73]][[3]]
[1] "Sunday, September 17 2017"

[[73]][[4]]
[1] "PowerColor Red Dragon Radeon RX 580 Dual-Fan 8GB GDDR5 PCIe Video Card" "PowerColor Red Dragon Radeon RX 580 Dual-Fan 8GB GDDR5 PCIe Video Card"
[3] "ASUS PRIME Z270-AR LGA 1151 ATX Intel Motherboard"                     

[[73]][[5]]
[1] "$329.99" "$329.99" "$159.99"

[[73]][[6]]
[1] "1" "1" "1"

[[73]][[7]]
[1] "Visa"


[[74]]
[[74]][[1]]
[1] 74

[[74]][[2]]
[1] "Order #: 065-PO-4079152"

[[74]][[3]]
[1] "Saturday, September 16 2017"

[[74]][[4]]
[1] "Olympia Tools Tool Set 53 Piece" "The Best Connection Cable Ties"                     

[[74]][[5]]
[1] "$12.99"  "$1.99"   

[[74]][[6]]
[1] "1" "1" 

[[74]][[7]]
[1] "Visa"

Here's a dput() version of the data directly in the post as suggested as well:

list(list(1, "Order #: 065-PO-4166764", "Friday, December 22 2017", 
    c("Belkin 12 Outlet Home Theater Surge Protector 3996 Joules with Phone/Fax/Coax Protection & 8 ft. Cord - Black", 
    "Match Competitor"), c("$25.90", "$0.00"), c("5", "1"), "Visa"), 
    list(2, "Order #: 065-PO-4067551", "Saturday, September 2 2017", 
        c("MSI Gaming X Radeon RX-580 Dual-Fan 8GB GDDR5 PCIe Video Card", 
        "QVS HDMI Female to DVI-D Male Video Adapter - Black", 
        "MSI Radeon RX 580 GAMING X 4GB GDDR5 Video Card"), c("$329.99", 
        "$9.99", "$269.99"), c("1", "1", "1"), "Master Card"), 
    list(3, "Order #: 041-PO-8823995", "Sunday, August 27 2017", 
        "MSI Armor Radeon RX-470 Overclocked Dual-Fan 8GB GDDR5 PCIe Video Card", 
        "$279.99", "1", "Master Card"))

Update

This post has been marked as a duplicate of other posts by moderators (for example, this one and this one), but it is not. There are other posts that have described how to collapse a list of dataframes into a single data frame, but this is not what I have. I have a list of lists that need to be collapased. Now, surely, there are posts that describe collapsing lists of list, but none that I have found are posts that describe a situation where there is a one-to many relationship present among the items in the sublists where recycling is desired. In the first example post, for example, each sublist contains only a single element, whereas the list I'm working with has several elements of unequal lengths and my shorter vectors need to be recycled within each list. The second example post doesn't address my situation either as the OP in that situation had a list of data frames. If I had a list of dataframes, my problem would be simply addressed with an inefficient do.call statement. To reiterate, my problem is one in which I have a list of lists of uneven length and each sublist (not in a dataframe), must be collapasedn with shorter elements within each recycled among the longer elementsn to form a dataframe in an efficient manner. I hope this clarifies.

StatsStudent
  • 1,384
  • 2
  • 10
  • 28
  • Please use `dput` to provide sample data. – Maurits Evers Sep 24 '18 at 03:59
  • @MauritsEvers, I've updated the post to make an ASCII representation of the sample list available for download. Thanks for the good suggestion. – StatsStudent Sep 24 '18 at 04:06
  • You should *always* include sample data in a reproducible and copy&paste-able format in your post. Many users are loath to download files through file hosters (myself included). An ASCII representation of your list is not useful since we can't copy&paste. So please edit your post, and include the `dput` output in your post directly. – Maurits Evers Sep 24 '18 at 04:08
  • 2
    @MauritsEvers, good point. I'm always scared to trust download links too. I've updated my post with the dput output. The post you have provided is not a duplicate because that shows a list of dataframes. My list is a list of lists with elements of unequal size which must be recycled across the longer elements. – StatsStudent Sep 24 '18 at 04:23
  • 2
    Ok, thanks for clarifying and making your example reproducible. – Maurits Evers Sep 24 '18 at 04:24
  • @Jaap, I have updated the post to describe why this is not a duplicate. Some of this information was also previously reported in the comments. Thank you for re-opening. – StatsStudent Sep 24 '18 at 18:58
  • The first link is about how to collapse a list of lists. It is thus a correct duplicate imo. – Jaap Sep 24 '18 at 19:13
  • The links seem to fit..? `library(data.table); rbindlist(lapply(L, as.data.table))` works, which seems to be roughly what's shown (with as.data.table to handle the recycling issue) – Frank Sep 24 '18 at 19:33
  • Fwiw, I would recommend making two tables `oDT = rbindlist(lapply(L, \`[\`, -(4:6))); oitemDT = rbindlist(lapply(lapply(L, \`[\`, c(1, 4:6)), as.data.table))` (linkable through the first column if necessary) – Frank Sep 24 '18 at 19:42
  • @Jaap. It is not a true duplicate. Collapsing a list of lists is only part of the question. To be sure none of the links provided fully addresses the question with a solution. Can we vote to re-open, please? I'd be interested in additional potential answers. Thanks. – StatsStudent Sep 25 '18 at 02:15

2 Answers2

1

you can do the following:

 do.call(rbind.data.frame,lapply(lst,function(x)as.matrix(unname(do.call(data.frame,rbind(x))))))
  V1                      V2                         V3
1  1 Order #: 065-PO-4166764   Friday, December 22 2017
2  1 Order #: 065-PO-4166764   Friday, December 22 2017
3  2 Order #: 065-PO-4067551 Saturday, September 2 2017
4  2 Order #: 065-PO-4067551 Saturday, September 2 2017
5  2 Order #: 065-PO-4067551 Saturday, September 2 2017
6  3 Order #: 041-PO-8823995     Sunday, August 27 2017
                                                                                                             V4      V5 V6
1 Belkin 12 Outlet Home Theater Surge Protector 3996 Joules with Phone/Fax/Coax Protection & 8 ft. Cord - Black  $25.90  5
2                                                                                              Match Competitor   $0.00  1
3                                                 MSI Gaming X Radeon RX-580 Dual-Fan 8GB GDDR5 PCIe Video Card $329.99  1
4                                                           QVS HDMI Female to DVI-D Male Video Adapter - Black   $9.99  1
5                                                               MSI Radeon RX 580 GAMING X 4GB GDDR5 Video Card $269.99  1
6                                        MSI Armor Radeon RX-470 Overclocked Dual-Fan 8GB GDDR5 PCIe Video Card $279.99  1
           V7
1        Visa
2        Visa
3 Master Card
4 Master Card
5 Master Card
6 Master Card
Onyambu
  • 67,392
  • 3
  • 24
  • 53
0

So, I think I discovered a few possible answers by combining a number of solutions I found on stackoverflow and elsewhere on the "interwebs." In fact, I cobbled together the four solutions below:

docall_DF<-do.call("rbind", lapply(ProductList, data.frame))

library(data.table)
rbindlist_DF<-as.data.frame(data.table::rbindlist(lapply(ProductList, data.frame)))

library(plyr)
plyr_rbind_DF<-plyr::rbind.fill(lapply(ProductList, data.frame))
plyr_ldply_DF = plyr::ldply(lapply(ProductList, data.frame), data.frame)

Since efficiency was my goal given the amount of data with which I was working, I performed a benchmark, and it appears that the rbindlist method in the data.table library is the best performer that does what I was hoping for:

library(benchmark)
benchmark(
docall_DF<-do.call("rbind", lapply(ProductList, data.frame)),
rbindlist_DF<-as.data.frame(data.table::rbindlist(lapply(ProductList, data.frame))),
plyr_rbind_DF<-plyr::rbind.fill(lapply(ProductList, data.frame)),
plyr_ldply_DF = plyr::ldply(lapply(ProductList, data.frame), data.frame),
     replications = 100, order = "relative", 
     columns=c('test','replications', 'elapsed','relative')
)

The results of the benchmark are below and shows that rbindlist is the fastest:

                                                                                 test replications elapsed relative
2 rbindlist_DF <- as.data.frame(data.table::rbindlist(lapply(ProductList, data.frame)))          100    9.03    1.000
1                        docall_DF <- do.call("rbind", lapply(ProductList, data.frame))          100   10.44    1.156
3                    plyr_rbind_DF <- plyr::rbind.fill(lapply(ProductList, data.frame))          100   10.44    1.156
4                                                                         plyr_ldply_DF          100   11.70    1.296

That being said, I'll leave the post open for a while in hopes someone else might provide some additional/better answers.

StatsStudent
  • 1,384
  • 2
  • 10
  • 28