1

I'm reading a yaml file like

- person_id: 111
  person_name: Russell
  time:
  - 1
  - 2
  - 3
  value:
  - a
  - b
  - c
- person_id: 222
  person_name: Steven
  time:
  - 1
  - 2
  value:
  - d
  - e

that I want to denormalize to:

  person_id person_name time value
1       111     Russell    1     a
2       111     Russell    2     b
3       111     Russell    3     c
4       222      Steven    1     d
5       222      Steven    2     e

I have a solution, but I was hoping there is something more concise. Here's the nested list:

l <- list(
  list( 
    person_id   = 111L,
    person_name = "Russell", 
    time        = 1:3, 
    value       = letters[1:3]
  ),
  list( 
    person_id   = 222L,
    person_name = "Steven", 
    time        = 1:2, 
    value       = letters[4:5]
  )
)   

Regarding possible duplicates, this question is similar to (1) How to denormalize nested list in R?, but the structure is different (the round/diff/saldo structure is transposed compared to time/value here), and to (2) Split comma-separated column into separate rows, but time is vector, instead of a comma-separated element like director. I'm hoping this different structure helps.

wibeasley
  • 5,000
  • 3
  • 34
  • 62
  • 1
    Here's a simple base R one liner: `do.call(rbind, lapply(l, data.frame))`. – lmo Nov 11 '17 at 21:00
  • @lmo, that's awesome. I like how `lapply()` does the work of replicating the parent variables of `person_id` and `person_name`. If you post this as a response, I'd love to vote on it. – wibeasley Nov 11 '17 at 21:04

5 Answers5

1
Reduce(rbind,lapply(l,data.frame))
submartingale
  • 715
  • 7
  • 16
1

To compliment the ideas/approaches by @lmo and @submartingale, here's a purrr/tidyverse version that converts each nested listed into a data.frame/tibble (by replicating the parent elements of name & id), then stacks them into a single tibble.

l %>% 
  purrr::map_df(tibble::as_tibble)

Thanks guys for proposing something so concise and generalizable.

wibeasley
  • 5,000
  • 3
  • 34
  • 62
1

A simple base R method is to use lapply and data.frame to return a list of data.frames and then used do.call with rbind to combine the data.frames into a single data.frame object.

do.call(rbind, lapply(l, data.frame))

which returns

  person_id person_name time value
1       111     Russell    1     a
2       111     Russell    2     b
3       111     Russell    3     c
4       222      Steven    1     d
5       222      Steven    2     e

Note that person_name and value will be factor vectors, which can be annoying to work with. If desired, you can convert these to character vectors using the stringsAsFactors argument.

do.call(rbind, lapply(l, data.frame, stringsAsFactors=FALSE))

The printed output looks the same, but the underlying data types of these two variables has changed.

lmo
  • 37,904
  • 9
  • 56
  • 69
0

This works, but is less than ideal because (a) each vector in the new data.frame needs to be handled and (b) the type of each vector is explicit (eg, purrr:map_chr vs purrr:map_int)

# Step 1: Determine how many time the 'parent' rows need to be replicated.
values_per_person <- l %>% 
  purrr::modify_depth(2, length) %>% 
  purrr::map_int("value")

# Step 2: Pull out the parent rows and replicate the elements to match `time`.
id_replicated <- l %>% 
  purrr::map_int("person_id") %>% 
  rep(times=values_per_person)    
name_replicated <- l %>%
  purrr::map_chr("person_name") %>% 
  rep(times=values_per_person)

# Step 3: Pull out the nested/child rows.
time <- l %>%
  purrr::modify_depth(1, "time") %>% 
  purrr::flatten_int()
value <- l %>%
  purrr::modify_depth(1, "value") %>% 
  purrr::flatten_chr()

# Step 4: Combine the vectors in a data frame.
data.frame(
  person_id   = id_replicated,
  person_name = name_replicated,
  time        = time,
  value       = value
)
wibeasley
  • 5,000
  • 3
  • 34
  • 62
0

(Four years later and I'm still using this once or twice a month.) The yaml package provides a map handler. In this case, each map/person is converted into a tibble. Then dplyr::bind_rows() stacks all the tibbles to create a longer, single tibble.

path_yaml |> # Replace this line with code below to see a working example.
  yaml::read_yaml(
    handlers = list(map = \(x) tibble::as_tibble(x))
  ) |> 
  dplyr::bind_rows()

Extra details: with this simple dataset, the handler isn't even required -- bind_rows() converts each piece automatically. But I'm skeptical that it will always know how to coerce each map before stacking. Plus this explicit handler better communicates the intent.

If you want to play with a reproducible example, replace the file path (i.e., the first line) with

string <- 
"- person_id: 111
  person_name: Russell
  time:
  - 1
  - 2
  - 3
  value:
  - a
  - b
  - c
- person_id: 222
  person_name: Steven
  time:
  - 1
  - 2
  value:
  - d
  - e
"

textConnection(string) |> 
  yaml::read_yaml(...
wibeasley
  • 5,000
  • 3
  • 34
  • 62