How can I obtain the distinct values for a "|" delimited column?

Question

I have a dataframe that looks like this:

+--+---------------------------+
|id|grids                      |
+--+---------------------------+
|c1|21257a|75589y|21257a|77589y|
|c2|21257a|21257a|21257a|21257a|
|c3|21257a|75589y|75589y|33421v|

However, since there are duplicate characters under the grids column, I'd like to extract only the distinct characters such that the dataframe becomes like this:

+--+---------------------------+
|id|grids                      |
+--+---------------------------+
|c1|21257a|75589y              |
|c2|21257a                     |
|c3|21257a|75589y|33421v       |

Any help would be appreciated!

score 1 · Answer 1 · answered May 27 '21 at 06:49

1

Using sapply split the string on |, keep only unique value in each row and paste.

df$grids <- sapply(strsplit(df$grids, '|', fixed = TRUE), function(x) 
                   paste0(unique(x), collapse = '|'))

answered May 27 '21 at 06:49

Ronak Shah

377,200
20
156
213

Tim Biegeleisen · Answer 2 · 2021-05-27T07:10:44.143

Here is a base R regex based approach:

df$grids <- gsub("\\b(.+?)(?=\\|.*\\1)", "", df$grids, perl=TRUE)
df$grids <- gsub("^\\|+|\\|+$", "", df$grids)
df$grids <- gsub("\\|{2,}", "|", df$grids)
df

  id                grids
1 c1        21257a|75589y
2 c2               21257a
3 c3 21257a|75589y|33421v

Data:

df <- data.frame(id=c("c1", "c2", "c3"),
                 grids=c("21257a|75589y|21257a|75589y",
                         "21257a|21257a|21257a|21257a",
                         "21257a|75589y|75589y|33421v"))

For an explanation of the regex \b(.+?)(?=\|.*\1), it matches any pipe-separated term for which we can find the same term later in the grid string. If so, then we strip it by replacing with empty string. There are also some cleanup steps to remove dangling multiple pipes which might be left behind (or at the beginning/end of the grid string).

AnilGoyal · Answer 3 · 2021-05-27T07:25:53.017

using data by @Tim

library(tidyverse)
df <- data.frame(id=c("c1", "c2", "c3"),
                 grids=c("21257a|75589y|21257a|75589y",
                         "21257a|21257a|21257a|21257a",
                         "21257a|75589y|75589y|33421v"))
df %>% mutate(grids = map_chr(str_split(grids, '\\|'), 
                              ~paste(unique(.x), collapse = '|')))
#>   id                  grids
#> 1 c1          21257a|75589y
#> 2 c2                 21257a
#> 3 c3   21257a|75589y|33421v

^{Created on 2021-05-27 by the reprex package (v2.0.0)}

How can I obtain the distinct values for a "|" delimited column?

3 Answers3