0

stuck on this for hours.

I am simplifying a >15000 line xml file, containing data on lung function tests. Each xml file contains multiple tests. Using xml2 and map I can get the data into a list of length n-of-tests.

Here is an extract of the list for two tests inside a file:

[[1]]
[[1]][[1]]
    Name       UM    Value 
"MEF75%"    "L/s"   "6.82" 

[[1]][[2]]
     Name        UM     Value Predicted  PercPred    ZScore       LLN       ULN 
   "FEV1"       "L"    "3.83"    "4.16"      "92"   "-0.62"    "3.27"    "5.01" 


...

[[2]]
[[2]][[1]]
    Name       UM    Value 
"MEF75%"    "L/s"   "6.65" 

[[2]][[2]]
     Name        UM     Value Predicted  PercPred    ZScore       LLN       ULN 
   "FEV1"       "L"    "3.79"    "4.16"      "91"   "-0.69"    "3.27"    "5.01" 
....

I can convert this into a tibble easily with map_dfr or bind_rows but what i cant seem to figure out is how to add the list index [[1]] or [[2]] as a column in the tibble. If i use the .id argument, it simply numbers the rows sequentially, doesnt refer to the list:

map(trials, ~xml_find_all(., "AdditionalData/Parameters/Parameter")) %>%
map (., ~xml_attrs(.)) %>% bind_rows(. , .id = "test")
A tibble: 104 x 9
   test    Name      UM    Value Predicted PercPred ZScore LLN   ULN  
   <chr> <chr>     <chr> <chr> <chr>     <chr>    <chr>  <chr> <chr>
 1 1     MEF75%    L/s   6.82  NA        NA       NA     NA    NA   
 2 2     FEV1      L     3.83  4.16      92       -0.62  3.27  5.01 
 ...
 53 53    MEF75% L/s   6.65  NA        NA       NA     NA    NA 
 54 54    FEV1  L     3.79  4.16      91       -0.69  3.27  5.01 

What I am trying to get to is (difference in first column - "test"):

map(trials, ~xml_find_all(., "AdditionalData/Parameters/Parameter")) %>%
map (., ~xml_attrs(.)) %>% bind_rows(. , .id = "test")
A tibble: 104 x 9
   test    Name      UM    Value Predicted PercPred ZScore LLN   ULN  
   <chr> <chr>     <chr> <chr> <chr>     <chr>    <chr>  <chr> <chr>
 1 1     MEF75%    L/s   6.82  NA        NA       NA     NA    NA   
 2 1     FEV1      L     3.83  4.16      92       -0.62  3.27  5.01 
 ...
 53 2    MEF75% L/s   6.65  NA        NA       NA     NA    NA 
 54 2    FEV1  L     3.79  4.16      91       -0.69  3.27  5.01 

Is this do-able with tidyverse? Should I try to work it out with a base-R loop?

Any help appreciated, thanks. -BF

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Ben Fox
  • 3
  • 3
  • can you provide a sample of the raw xml data? (or share the complete xml file on a fileshare somewhere) – Wimpel Dec 01 '21 at 11:02
  • You may need to nest another map inside `map (., ~xml_attrs(.))....` as in `map(., ~map(.x, \(x) xml_attrs(x)....` – GuedesBF Dec 01 '21 at 11:08
  • @Wimpel thank you - here is a link to google drive: https://drive.google.com/file/d/1WZ_cPvhknGx7fz-fJey_wsJYQJ2kT6Q6/view?usp=sharing – Ben Fox Dec 01 '21 at 18:58

1 Answers1

0

To make ID columns based on variable length list elements we can repeat the index of elements (see this) in the list, number of elements times.

x <- list(
  list(
    c(Name = "a", UM = "L/s", Value = "1"),
    c(Name = "a", UM = "L", Value = "3.1", Predicted = "1")
  ),
  list(
    c(Name = "b", UM = "L", Value = "2"),
    c(Name = "b", UM = "L/s", Value = "4", Predicted = "1.1"),
    c(Name = "b", UM = "L/s", Value = "4", Predicted = "1.1", ZScore = "-.50")
  ),
  list(1)
)
y <- sapply(x, length)
unlist(Map(function(n, i) rep(i, n), y, seq_along(y)), use.names = F)
#> [1] 1 1 2 2 2 3

Or using tidyverse functions

imap(map_int(x, length), ~rep(.y, .x)) %>% flatten_int()
#> [1] 1 1 2 2 2 3

And add that as the ID column. If the number of tests are equal (2 in the original post), simply rep(1:length(x), each = 2) where each argument is the number of tests.

Its not entirely clear to me if the list you showed in the post is a named vector or data.frames with 1 row. In any case - an alternative, using set_names since bind_rows can take a named list:

list(
  data.frame(x = 1, y = 2),
  data.frame(x = 10, y = 15)
) %>%
  set_names(1:2) %>%
  bind_rows(.id = "test") # %>% a character column
  # mutate(test = as.numeric(test))

#>   test  x  y
#> 1    1  1  2
#> 2    2 10 15
Donald Seinen
  • 4,179
  • 5
  • 15
  • 40
  • Thanks very much @Donald. I solved my issue with the tidyverse code and i decided to stop trying to be a smart-ass and I just made an intermediate temporary variable to store the converted xml as list of lists and then `mutate`d your code to a new variable. Next time I'll try not be so clever, takes me to much time. – Ben Fox Dec 02 '21 at 12:22
  • @BenFox when I use the code you supplied in the post on the sample.xml from google drive I get an error. However, from reading `?bind_rows`, it can take a named list. So, if you have to run this code multiple times might be worth looking at `set_names` (use it after you obtain a list of data.frames) - then you can add a `group_by(test) %>% mutate(ID = group_indices()`. – Donald Seinen Dec 02 '21 at 13:10