0

I have a dataframe that looks like this:

+--+---------------------------+
|id|grids                      |
+--+---------------------------+
|c1|21257a|75589y|21257a|77589y|
|c2|21257a|21257a|21257a|21257a|
|c3|21257a|75589y|75589y|33421v|

However, since there are duplicate characters under the grids column, I'd like to extract only the distinct characters such that the dataframe becomes like this:

+--+---------------------------+
|id|grids                      |
+--+---------------------------+
|c1|21257a|75589y              |
|c2|21257a                     |
|c3|21257a|75589y|33421v       |

Any help would be appreciated!

code_learner93
  • 571
  • 5
  • 12

3 Answers3

1

Using sapply split the string on |, keep only unique value in each row and paste.

df$grids <- sapply(strsplit(df$grids, '|', fixed = TRUE), function(x) 
                   paste0(unique(x), collapse = '|'))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Here is a base R regex based approach:

df$grids <- gsub("\\b(.+?)(?=\\|.*\\1)", "", df$grids, perl=TRUE)
df$grids <- gsub("^\\|+|\\|+$", "", df$grids)
df$grids <- gsub("\\|{2,}", "|", df$grids)
df

  id                grids
1 c1        21257a|75589y
2 c2               21257a
3 c3 21257a|75589y|33421v

Data:

df <- data.frame(id=c("c1", "c2", "c3"),
                 grids=c("21257a|75589y|21257a|75589y",
                         "21257a|21257a|21257a|21257a",
                         "21257a|75589y|75589y|33421v"))

For an explanation of the regex \b(.+?)(?=\|.*\1), it matches any pipe-separated term for which we can find the same term later in the grid string. If so, then we strip it by replacing with empty string. There are also some cleanup steps to remove dangling multiple pipes which might be left behind (or at the beginning/end of the grid string).

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

using data by @Tim

library(tidyverse)
df <- data.frame(id=c("c1", "c2", "c3"),
                 grids=c("21257a|75589y|21257a|75589y",
                         "21257a|21257a|21257a|21257a",
                         "21257a|75589y|75589y|33421v"))
df %>% mutate(grids = map_chr(str_split(grids, '\\|'), 
                              ~paste(unique(.x), collapse = '|')))
#>   id                  grids
#> 1 c1          21257a|75589y
#> 2 c2                 21257a
#> 3 c3   21257a|75589y|33421v

Created on 2021-05-27 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45