Merge strings variables across data tables separated by a comma and remove duplicates in R

Question

I have multiple dataframes that look like this:

df.1 <- data.frame(name = c("A", "B", "C"),
                   var1 = c("cat,dog", "dog,horse,bird", "eagle,fox,chick"))

df.2 <- data.frame(name = c("A", "B", "C"),
                   var2 = c("cat,dog,worm", "dog,horse,bird", "giraffe"))

df.3 <- data.frame(name = c("A", "B", "C"),
                   var3 = c("cat,dog,worm", "dog,horse,bird,cat,lion", NA))
## merged 
all <- df.1 %>% 
  left_join(df.2) %>% 
  left_join(df.3)

> all
  name            var1           var2                    var3
1    A         cat,dog   cat,dog,worm            cat,dog,worm
2    B  dog,horse,bird dog,horse,bird dog,horse,bird,cat,lion
3    C eagle,fox,chick        giraffe                    <NA>

I would now like to merge vars1 to 3 to get a concise list without duplicates.

So for example the combined data frame should look like:

  name                    var1
1    A            cat,dog,worm
2    B dog,horse,bird,cat,lion
3    C eagle,fox,chick,giraffe

paste will allow for merging comma separated strings but I am not sure how to remove the duplicates?

I would like to do this using tidyverse and not another package.

This might help: https://stackoverflow.com/questions/35286596/how-to-remove-duplicate-comma-separated-character-values-from-each-cell-of-a-col Just paste together all the values from the columns first: https://stackoverflow.com/questions/14568662/paste-multiple-columns-together — MrFlick, Jul 04 '20 at 07:08
There's nothing special in dplyr for the splitting. That's purely a string operation. Your data isn't "tidy" so it's not great for dplyr verbs. The `stringr` package it more helpful for the splitting. Or you might want to use the `tidyr::separate_rows()` function to make your data more tidy. Work with the values not as a comma separated string, but as proper cell values. — MrFlick, Jul 04 '20 at 07:13

score 1 · Accepted Answer · answered Jul 04 '20 at 08:45

You can get the data in long format, get values in different rows, for each name create a unique non-NA comma-separated value.

library(dplyr)
library(tidyr)

all %>%
  pivot_longer(cols = var1:var3, names_to = 'col') %>%
  separate_rows(value) %>%
  group_by(name) %>%
  summarise(var1 = toString(na.omit(unique(value))))

# name  var1                       
#  <chr> <chr>                      
#1 A     cat, dog, worm             
#2 B     dog, horse, bird, cat, lion
#3 C     eagle, fox, chick, giraffe

In base R, we can use apply row-wise :

all$var4 <- apply(all[-1], 1, function(x) 
                  toString(na.omit(unique(unlist(strsplit(x, ','))))))

score 0 · Answer 2 · answered Jul 04 '20 at 08:16

Using baseR,

all$VAR <- sapply(seq(nrow(all)),function(x) {

            k <- paste(unlist(all[x,2:4][,!is.na(all[x,2:4])]),collapse=",")
            paste(unique(unlist(strsplit(k,","))),collapse=",") })

gives,

  name            var1           var2                    var3                     VAR
1    A         cat,dog   cat,dog,worm            cat,dog,worm            cat,dog,worm
2    B  dog,horse,bird dog,horse,bird dog,horse,bird,cat,lion dog,horse,bird,cat,lion
3    C eagle,fox,chick        giraffe                    <NA> eagle,fox,chick,giraffe

score 0 · Answer 3 · answered Jul 04 '20 at 08:30

This is a fully tidyverse solution, although somewhat lengthy in code. It can surely be optimized, but it answers your question.

For a better explanation of all the steps, I suggest you to run them one by one to understand the performed operations and how the dataset changes after each step.

library(tidyverse)

all %>% 
  pivot_longer(starts_with("var"), names_to = "vars") %>% 
  mutate(value = str_split(value, ",")) %>% 
  unnest(value) %>%
  filter(!is.na(value)) %>% 
  group_by(name) %>% 
  distinct(value) %>% 
  mutate(var = paste(value, collapse = ",")) %>% 
  select(-value) %>% 
  ungroup() %>% 
  distinct()

# A tibble: 3 x 2
#   name  var                    
#   <chr> <chr>                  
# 1 A     cat,dog,worm           
# 2 B     dog,horse,bird,cat,lion
# 3 C     eagle,fox,chick,giraffe

Chris Ruehlemann · Answer 4 · 2020-07-04T08:59:46.993

Here's a base Rsolution:

First step: paste the rows together, strsplit them, reduce to unique values, combine toString, and remove NA:

var1 <- gsub(", NA", "", unlist(as.character(lapply(sapply(strsplit(apply(all[2:4], 1, paste0, collapse = " "), " |,"), unique), toString))))

Second step: column-bind all[1] with var1

new <- cbind(all[1], var1)

Result:

new
  name                        var1
1    A              cat, dog, worm
2    B dog, horse, bird, cat, lion
3    C  eagle, fox, chick, giraffe

Merge strings variables across data tables separated by a comma and remove duplicates in R

4 Answers4