0

I have several lists. For example:

b1 <-
  list(
    duck = list(
      day = "Monday",
      location = list("Cisco Park", "Elephant Park")
    ),
    eagle = list(day = "Saturday"),
    pigeon = list(location = "backyard")
  )

b2 <- list(
  duck = list(day = "Tuesday", location = "Valley Green"),
  goose = list(location = "Old man Johnson's Farm")
)

I would like to merge them in a way that aggregates the elements for each element of those lists. This will only be recursive to the extent that day or list could be a vector in the original lists. But that's a deep as it goes.

desired <-
  list(
    duck = list(
      day = list("Monday", "Tuesday"),
      location = list("Cisco Park", "Elephant Park", "Valley Green")
    ),
    eagle = list(day = "Saturday"),
    pigeon = list(location = "backyard"),
    goose = list(location = "Old man Johnson's Farm")
  )

I wrote an lapply() solution that works but is long and super slow. Next I tried Combine/merge lists by elements names :

l <-
  list(b1, b2)
keys <- unique(unlist(lapply(l, names)))
merged <-
  setNames(do.call(mapply, c(FUN = c, lapply(l, `[`, keys))), keys)

dput(merged)

That's fast, merges the two lists, but creates multiple elements with the same name:

list(duck = list(day = "Monday", location = list("Cisco Park", 
    "Elephant Park"), day = "Tuesday", location = "Valley Green"), 
    eagle = list(day = "Saturday"), pigeon = list(location = "backyard"), 
    goose = list(location = "Old man Johnson's Farm"))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
Mark Miller
  • 3,011
  • 1
  • 14
  • 34

3 Answers3

0

I have no idea how something like this would be done efficiently but here is an option:

out <- c(b1, b2[setdiff(names(b2), names(b1))])
for (k in intersect(names(b2), names(b1))) {
  elements <- names(b2[[k]])
  for (v in elements) {
    if (v %in% names(b1[[k]])) {
      out[[k]][[v]] <- unique(c(b1[[k]][[v]], b2[[k]][v]))
    } else {
      out[[k]][[v]] <- b2[[k]][v]
    }
  }
}

Maybe a list is not the optimal data structure for the task at hand? I would try data.table with list columns. Here is a raw example:

library(data.table)
bDT <- rbindlist(lapply(bt, function(x) lapply(x, identity)), fill = TRUE)
bDT[, animal := rep(names(bt), sapply(bt, function(x) max(lengths(x))))]
bDT <- bDT[, .(day = list(unique(day)), location = list(unique(location))), by = animal]

bDT[animal == "duck", location]
# [[1]]
# [[1]][[1]]
# [1] "Cisco Park"
# 
# [[1]][[2]]
# [1] "Elephant Park"
# 
# [[1]][[3]]
# [1] "Valley Green"
s_baldur
  • 29,441
  • 4
  • 36
  • 69
  • Thanks. One of my lists has ~ 200,000 items. Another has ~50,000. I have been testing by making increasingly larger subsets of those lists. From 1,000 to 90,000, the task completes in less than a second. At 100,000 items, it doesn't complete after several minutes.I have a 2019 MacBook Pro with 32 GB RAM. It's not busy with anything else. – Mark Miller May 22 '20 at 15:03
  • @MarkMiller Updated the answer. – s_baldur May 22 '20 at 15:28
  • Thanks for brainstorming. I'm going to keep trying with lists for at least another day or two. I'll try this at some point. – Mark Miller May 22 '20 at 15:32
0

What an interesting question!

Does this approach with nested lapplys get what you want? I am not sure how efficent it is - I guess it shouldn't be too bad (leaving in the NULL values should improve performance).

l <- list(b1, b2)
l1.keys <- unique(unlist(lapply(l, names)))
l2.keys <- unique(unlist(lapply(lapply(l, "[[", 1), names)))

l1 = lapply(l1.keys, function(l1.key){
    l2 = lapply(l2.keys, function(l2.key){
        as.list(do.call("c", lapply(lapply(l, "[[", l1.key), "[[", l2.key))) # This [as.list(c())]...
    })
    names(l2) = l2.keys
    l2[lengths(l2) != 0] # ...and this prevent NULL in the output.
})
names(l1) = l1.keys
l1

Output:

$duck
$duck$day
$duck$day[[1]]
[1] "Monday"

$duck$day[[2]]
[1] "Tuesday"


$duck$location
$duck$location[[1]]
[1] "Cisco Park"

$duck$location[[2]]
[1] "Elephant Park"

$duck$location[[3]]
[1] "Valley Green"



$eagle
$eagle$day
$eagle$day[[1]]
[1] "Saturday"



$pigeon
$pigeon$location
$pigeon$location[[1]]
[1] "backyard"



$goose
$goose$location
$goose$location[[1]]
[1] "Old man Johnson's Farm"
randr
  • 255
  • 1
  • 7
0

This is basically a recursive merging of list. I'm not sure how fast this is on the full data set, but this should also work on an arbitrary depth

library(purrr)
library(magrittr)

mergeMyLists <- function(x, y){

  # determine names missing from x
  newNames <- setdiff(names(y), names(x))

  # create new list with all names and data from x
  z <- x
  if(length(newNames) > 0) {
    z[seq(length(x)+1, length(x) + length(newNames)) ] <- list(NULL)
    names(z) <- c(names(x), newNames)
  }

  # if member is list recursively join
  # else combine and make sure it is a list as sometimes input is vector
  map(names(z),
      ~{
        if(is.list(z[[..1]]) && is.list(y[[..1]])) {
          mergeMyLists(z[[..1]], y[[..1]]) 
        } else {
          tmp <- c(z[[..1]], y[[..1]])
          if(length(tmp)> 1 && !is.list(tmp)) as.list(tmp) else tmp
        }
      }
  ) %>% 
    set_names(names(z))
}

mergedList <- mergeMyLists(b1, b2)

You may be able to speed it up by making your input data more consistent (that way it won't have to do so many checks). For example, b1$duck$location is a list while b2$duck$location is a character vector. If the merging function knows both of these will always be list, it won't have to check and potentially cast the output as a list to achieve your desired structure.

Marcus
  • 3,478
  • 1
  • 7
  • 16
  • thanks. As-is, it's pretty slow on my ~ 200k item lists. I'll look into making the items uniform. – Mark Miller May 22 '20 at 17:55
  • 1
    Some thoughts on optimization (which probably could be a separate post) - do you have a sense of if most of your entries require merging (e..g duck) or just copying from one set (e.g. goose)? This could give you an idea on which sub-task to focus on. Right now the method in this answer just treats everything as needing to be merged. Also, it arbitrarily chooses the first list to start with as a base (`z <- x`). Not sure for sure but there may be a gain from selecting the reference list based on new items that need to be added. – Marcus May 28 '20 at 12:04
  • thanks, those are good design considerations. I switched to aggregating my data into a dataframe, with a single row per "animal", and with pipe delimited values in each column if necessary. I wrote a small function to build the list, and that only takes a minute or two. But it feels in elegant. Now I just have to find a good way to recognize the feedback I've received! – Mark Miller May 28 '20 at 12:54