1

i have a dataframe with two columns:

 VAR1.                     VAR2.

  A.              102 million; 102 million

  B.      0.1 million; 2 million; 0.1 million; 2 million

I want to remove duplicate values of VAR2. for each row, obtaining

 VAR1.           VAR2.

  A.          102 million

  B.     0.1 million; 2 million

How can I do?

thank you for your suggestions.

Silvia
  • 405
  • 4
  • 17

2 Answers2

2

Using base R, we can split the string on ";" and paste unique entries for VAR2

sapply(strsplit(df$VAR2, ";"), function(x) paste(unique(x), collapse = ";"))
#[1] "102 million"           "0.1 million;2 million"

Using dplyr and tidyr we can use separate_rows to bring VAR2 into different rows and then paste only unique entries per group.

library(dplyr)
library(tidyr)

df %>%
 separate_rows(VAR2, sep = ";") %>%
 group_by(VAR1) %>%
 summarise(VAR2 = paste(unique(VAR2), collapse = ";"))

#  VAR1  VAR2                 
#  <fct> <chr>                
#1 A     102 million          
#2 B     0.1 million;2 million
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is a solution using sub which seems to work:

x <- "0.1 million; 2 million; 0.1 million; 2 million"
gsub("\\b(\\d+(?:\\.\\d+)?) ([^;]+); (?=.*\\b\\1 \\2\\b)", "", x, perl=TRUE)

[1] "0.1 million; 2 million"

The general strategy used here is to match a number, with an optional decimal component, followed by another word, provided that this number-word term appears at least once downstream in the input string. If it does appear again, then we remove the first terms by replacing with empty string. Note that the last occurrence of a pair of terms would not be deleted, owing to that the positive lookahead would fail.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360