Collapse duplicated rows with different values in different columns using R

Question

I have a data frame with 500 observations, but I am only showing 3 in my example. These are duplicates with different values in different columns (except ID column, which includes which person is duplicated). I am replicating how the data frame is (df) and how it should look like after treatment (df_new). Is this possible ? The data frame is 10 variables, so I am not worried about 'doubling' them. The values in the variables are a,b,c,d,0,''. I however kept them more general in the tables.

df <- data.frame(ID =  c('1','1','2', '2', '3','3'),
                 Year = c('smaller year.1', 'bigger year.1', 'bigger year.2', 'smaller year.2', 'same year.3', 'same year.3'),
                 V1 = c('a', 'b','c','d','e','f'),
                 V2 = c('g', 'h', 'i', 'j', 'k', 'l'),
                 Vn = c('n1', 'n2','n3','n4','n5','n6'))


df_new <- data.frame(ID = c('1','2','3'),
                     Year_smaller = c('smaller year.1', 'smaller year.2', 'same year.3'),
                     Year_bigger = c('bigger year.1', 'bigger year.2', 'same year.3'),
                     V1 = c('a','c','e'),
                     V1.1 = c('b','d','f'),
                     V2 = c('g','i','k'),
                     V2.1 = c('h','j','l'),
                     Vn = c('n1','n3','n5'),
                     Vn.1 = c('n2','n4','n6'))

AnilGoyal · Accepted Answer · 2021-06-19T16:26:56.420

For edited data and as per revised requirements. Since in alphabet b comes before s therefor bigger_year is shown before smaller_year however, in the real data you'll have correctly sorted years. Still if you want to sort strings like that use sort(desc(Year)) instead of sort(Year)

df <- data.frame(ID =  c('1','1','2', '2', '3','3'),
                 Year = c('smaller year.1', 'bigger year.1', 'bigger year.2', 'smaller year.2', 'same year.3', 'same year.3'),
                 V1 = c('a', 'b','c','d','e','f'),
                 V2 = c('g', 'h', 'i', 'j', 'k', 'l'),
                 Vn = c('n1', 'n2','n3','n4','n5','n6'))


library(tidyverse)

df %>% group_by(ID) %>% mutate(Year = sort(Year)) %>% 
  mutate(rid = row_number()) %>%
  pivot_wider(id_cols = ID, names_from = rid, values_from = c(Year:Vn), names_sep = '')

#> # A tibble: 3 x 9
#> # Groups:   ID [3]
#>   ID    Year1         Year2          V11   V12   V21   V22   Vn1   Vn2  
#>   <chr> <chr>         <chr>          <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1     bigger year.1 smaller year.1 a     b     g     h     n1    n2   
#> 2 2     bigger year.2 smaller year.2 c     d     i     j     n3    n4   
#> 3 3     same year.3   same year.3    e     f     k     l     n5    n6

^{Created on 2021-06-19 by the reprex package (v2.0.0)}

library(tidyverse)

df %>% group_by(ID) %>% mutate(rid = row_number()) %>%
  pivot_wider(id_cols = ID, names_from = rid, values_from = c(Year:Variable_n), names_sep = '')

# A tibble: 3 x 9
# Groups:   ID [3]
  ID    Year1          Year2          Variable_a1 Variable_a2 Variable_b1 Variable_b2 Variable_n1 Variable_n2
  <chr> <chr>          <chr>          <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 1     smaller year.1 bigger year.1  va11        va12        vb11        vb12        vn11        vn12       
2 2     bigger year.2  smaller year.2 va21        va22        vb21        vb22        vn21        vn22       
3 3     same year.3    same year.3    va31        va32        vb31        vb32        vn31        vn32

Do u mean this?


df %>% group_by(ID) %>% arrange(desc(Year)) %>% mutate(rid = row_number()) %>%
  pivot_wider(id_cols = ID, names_from = rid, values_from = c(Year:Variable_n), names_sep = '')

# A tibble: 3 x 9
# Groups:   ID [3]
  ID    Year1          Year2         Variable_a1 Variable_a2 Variable_b1 Variable_b2 Variable_n1 Variable_n2
  <chr> <chr>          <chr>         <chr>       <chr>       <chr>       <chr>       <chr>       <chr>      
1 2     smaller year.2 bigger year.2 va22        va21        vb22        vb21        vn22        vn21       
2 1     smaller year.1 bigger year.1 va11        va12        vb11        vb12        vn11        vn12       
3 3     same year.3    same year.3   va31        va32        vb31        vb32        vn31        vn32

Would it be possible to include the smaller years in the year_smaller column and the bigger ones in the year_bigger? Should I arrange the data according to ID and year beforehand? — Youknowme, Jun 19 '21 at 12:15
The order of vx2y should be inversed, as in : smaller year.2 , bigger year.2, va21,va22,vb21,vb22,vn22. The order matters since they are coupled with the years. The rest of your outcome for ID 1 and 3 is perfect !! It is strange that only ID 2 doesn't work. — Youknowme, Jun 19 '21 at 12:47
Hello @AnilGoyal, in your outcome at variable_a1 x ID2 the answer should be va21 and not va22. The second index in the outcome should all be the same for a given column, whereas the first index represents ID. — Youknowme, Jun 19 '21 at 13:05
Hello @AnilGoyal :) I tried your solution on a different data but it didn't work. I edited my dataframe with the desired output. (hopefully it is clearer) — Youknowme, Jun 19 '21 at 14:20

score 2 · Answer 2 · answered Jun 19 '21 at 09:06

Here is one data.table option -

library(data.table)

cols <- grep('Variable', names(df), value = TRUE)
dcast(setDT(df), ID~rowid(ID), value.var = c('Year', cols))


#   ID         Year_1         Year_2 Variable_a_1 Variable_a_2 Variable_b_1
#1:  1 smaller year.1  bigger year.1         va11         va12         vb11
#2:  2  bigger year.2 smaller year.2         va21         va22         vb21
#3:  3    same year.3    same year.3         va31         va32         vb31

#   Variable_b_2 Variable_n_1 Variable_n_2
#1:         vb12         vn11         vn12
#2:         vb22         vn21         vn22
#3:         vb32         vn31         vn32

Anoushiravan R · Answer 3 · 2021-06-19T13:28:42.970

2

We can also use the following solution:

library(dplyr)
library(tidyr)
library(purrr)

df %>%
  group_split(ID) %>%
  map_dfr(~ .x %>% 
            mutate(id = row_number()) %>%
            pivot_wider(names_from = id, values_from = c(Year, Variable_a, Variable_b, Variable_n),
                        names_sep = "") %>%
            rename(Year_smaller = Year1,
                   Year_bigger = Year2)) %>%
  select(starts_with("Year"))

# A tibble: 3 x 9
  ID    Year_smaller   Year_bigger    Variable_a1 Variable_a2 Variable_b1 Variable_b2 Variable_n1
  <chr> <chr>          <chr>          <chr>       <chr>       <chr>       <chr>       <chr>      
1 1     smaller year.1 bigger year.1  va11        va12        vb11        vb12        vn11       
2 2     bigger year.2  smaller year.2 va21        va22        vb21        vb22        vn21       
3 3     same year.3    same year.3    va31        va32        vb31        vb32        vn31       
# ... with 1 more variable: Variable_n2 <chr>

edited Jun 19 '21 at 13:28

answered Jun 19 '21 at 10:09

Anoushiravan R

21,622
3
18
41

1

Would it be possible to include the smaller years in the year_smaller column and the bigger ones in the year_bigger? Should I arrange the data according to ID and year beforehand? – Youknowme Jun 19 '21 at 12:14
@GabraNohmie Yes I made a slight modification, I hope this is what you are looking for. – Anoushiravan R Jun 19 '21 at 12:40
1

The order of vx2y should be inversed, as in : smaller year.2 , bigger year.2, va21,va22,vb21,vb22,vn22. The order matters since they are coupled with the years. The rest of your outcome for ID 1 and 3 is perfect !! It is strange that only ID 2 doesn't work. Same thing happened with @AnilGoyal – Youknowme Jun 19 '21 at 12:50
@GabraNohmie Yes I got that, maybe we could figure out a way to exchange those smaller and bigger between the `Year_smaller` and `Year_bigger`. Would it work for you? I mean was the configuration of my first answer was something you were looking for? Yeah Anil is a close friend of mine we have a very similar coding style lol. – Anoushiravan R Jun 19 '21 at 13:04
Oh ! Well both of your answers are great, including the configuration. Only problem is the outcome of ID 2 which should follow ID 1 and 3. A data.table answer was provided, but I am interested in seeing how this would work with dplyr and tidyr as well :) – Youknowme Jun 19 '21 at 13:15
1

But @GabraNohmie isn't the output with `data.table` same with these? – AnilGoyal Jun 19 '21 at 13:43

score 1 · Answer 4 · answered Jun 19 '21 at 04:52

Here is my inelegant answer. This might give you some idea.

library(tidyverse)

year_df <- df %>% 
  select(ID, Year) %>% 
  filter(!str_detect(Year, "^same")) %>% 
  mutate(year_group = map_chr(Year, ~str_split(., " ")[[1]][1])) %>% 
  pivot_wider(
    names_from = year_group,
    values_from = Year
  ) %>% 
  add_row(ID = "3", smaller = "same year.3", bigger = "same year.3")

df_new <- df %>% 
  select(-Year) %>%
  pivot_longer(-ID) %>% 
  mutate(
    group = paste0(str_sub(name, -1), str_sub(value, -1)),
    name = str_remove(name, "_[a-z]")
  ) %>%
  pivot_wider(
    names_from = c(name, group),
    values_from = value
  ) %>% 
  left_join(year_df, by = "ID") %>% 
  relocate(c(smaller, bigger), .after = ID)

df_new

Collapse duplicated rows with different values in different columns using R

4 Answers4

Linked