0

After parsing xml files I have data looking like this:

example_df <-  
  tibble(id = "ABC",
         wage_type = "salary",
         name = c("Description","Code","Base",
                  "Description","Code","Base",
                  "Description","Code"),
         value = c("wage_element_1","51B","600",
                   "wage_element_2","51C","740",
                   "wage_element_3","51D"))

example_df 

# A tibble: 8 x 4
  id    wage_type name        value         
  <chr> <chr>     <chr>       <chr>         
1 ABC   salary    Description wage_element_1
2 ABC   salary    Code        51B           
3 ABC   salary    Base        600           
4 ABC   salary    Description wage_element_2
5 ABC   salary    Code        51C           
6 ABC   salary    Base        740           
7 ABC   salary    Description wage_element_3
8 ABC   salary    Code        51D      

with roughly 1000 different id, and each having three possible values for wage_type. I want to change the values in the name column into columns. I have tried to use pivot but I am struggling to handle the resulting list-cols: since not all salary have a Base, the resulting list-cols are of different size as below:

example_df <- example_df %>%
  pivot_wider(id_cols = c(id, wage_type),
              names_from = name,
              values_from = value)

example_df

# A tibble: 1 x 5
  id    wage_type Description Code      Base     
  <chr> <chr>     <list>      <list>    <list>   
1 ABC   salary    <chr [3]>   <chr [3]> <chr [2]>

So when I try to unnest the cols it throws an error:

example_df%>%
  unnest(cols = c(Description,Code,Base))

Error: Can't recycle `Description` (size 3) to match `Base` (size 2).

I understand that is because tidyr functions do not recycle, but I could not find a way around this or a base rsolution to my problem. I have tried to make a df with unlist(strsplit(as.character(x)) solution as per how to split one row into multiple rows in R but also ran into a resulting column length issue.

Desired output is as follows:

desired_df <- 
  tibble(
    id=c("ABC","ABC","ABC"),
    wage_type=c("salary","salary","salary"),
    Description = c("wage_element_1","wage_element_2","wage_element_3"),
    Code = c("51B","51C","51D"),
    Base = c("600","740",NA))

desired_df

id    wage_type Description    Code  Base 
  <chr> <chr>     <chr>          <chr> <chr>
1 ABC   salary    wage_element_1 51B   600  
2 ABC   salary    wage_element_2 51C   740  
3 ABC   salary    wage_element_3 51D   NA  

I would love a tidyr solution but any help would be appreciated. Thanks.

flafont11
  • 137
  • 7

1 Answers1

1

I would suggest this approach using tidyverse functions. The issue you had was due to how functions manage different rows. So, by creating an id variable like id2 you can avoid list outputs in your final reshaped data:

library(tidyverse)
#Code
example_df %>% 
  arrange(name) %>%
  group_by(id,wage_type,name) %>%
  mutate(id2=1:n()) %>% ungroup() %>%
  pivot_wider(names_from = name,values_from=value) %>%
  select(-id2)

Output:

# A tibble: 3 x 5
  id    wage_type Base  Code  Description   
  <chr> <chr>     <chr> <chr> <chr>         
1 ABC   salary    600   51B   wage_element_1
2 ABC   salary    740   51C   wage_element_2
3 ABC   salary    NA    51D   wage_element_3
Duck
  • 39,058
  • 13
  • 42
  • 84