Remove duplicate values on each string in R

Question

i have a dataframe with two columns:

 VAR1.                     VAR2.

  A.              102 million; 102 million

  B.      0.1 million; 2 million; 0.1 million; 2 million

I want to remove duplicate values of VAR2. for each row, obtaining

 VAR1.           VAR2.

  A.          102 million

  B.     0.1 million; 2 million

How can I do?

thank you for your suggestions.

score 2 · Answer 1 · answered May 27 '19 at 11:12

Using base R, we can split the string on ";" and paste unique entries for VAR2

sapply(strsplit(df$VAR2, ";"), function(x) paste(unique(x), collapse = ";"))
#[1] "102 million"           "0.1 million;2 million"

Using dplyr and tidyr we can use separate_rows to bring VAR2 into different rows and then paste only unique entries per group.

library(dplyr)
library(tidyr)

df %>%
 separate_rows(VAR2, sep = ";") %>%
 group_by(VAR1) %>%
 summarise(VAR2 = paste(unique(VAR2), collapse = ";"))

#  VAR1  VAR2                 
#  <fct> <chr>                
#1 A     102 million          
#2 B     0.1 million;2 million

Tim Biegeleisen · Answer 2 · 2019-05-27T11:07:38.850

1

Here is a solution using sub which seems to work:

x <- "0.1 million; 2 million; 0.1 million; 2 million"
gsub("\\b(\\d+(?:\\.\\d+)?) ([^;]+); (?=.*\\b\\1 \\2\\b)", "", x, perl=TRUE)

[1] "0.1 million; 2 million"

The general strategy used here is to match a number, with an optional decimal component, followed by another word, provided that this number-word term appears at least once downstream in the input string. If it does appear again, then we remove the first terms by replacing with empty string. Note that the last occurrence of a pair of terms would not be deleted, owing to that the positive lookahead would fail.

edited May 27 '19 at 11:07

answered May 27 '19 at 11:02

Tim Biegeleisen

502,043
27
286
360

Is there a way to generalize the above to include any repeat pattern, not just that follow the above pattern? – Cyrus Mohammadian Jan 30 '23 at 18:23

Remove duplicate values on each string in R

2 Answers2

Linked

Related