How to count string with unique and enter them in another column in R

Question

I have a dataset with 12000+ records, that looks like below which I need to count the strings. the dataset looks like

Drugs                                      Gender   year
met,met,sulp,DPP                            M       2020
met and sulp and DPP                        M       2021
SGLT SGLT SGLT                              M       2018
Incretin, AGI, AGI                          F       2019
THK, USP                                    F       2013

I need output like this kindly suggest me

Drugs                      number of drugs  Gender  year
met,met,sulp,DPP                3             M     2020
met and sulp and DPP            3             M     2021
SGLT SGLT SGLT                  1             M     2018
Incretin, AGI, AGI              2             F     2019
THK, USP                        2             F     2013

Thanks in advance

how is the 1st value 5? Do you also want to count duplicate values separately ? Try `stringr::str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))` — Ronak Shah, Jul 29 '21 at 04:44

Ronak Shah · Accepted Answer · 2021-07-29T05:00:37.037

3

You can use stringr::str_count to count number of 'DRUG' values.

library(stringr)

df$num_drugs <- str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))

To count the unique values you can use -

df$num_drugs <- sapply(strsplit(df$Drugs, ',\\s*'), function(x) length(unique(x)))

edited Jul 29 '21 at 05:00

answered Jul 29 '21 at 04:51

Ronak Shah

377,200
20
156
213

1

I had upvoted but now the OP has radically changed the input... – Rui Barradas Jul 29 '21 at 04:56
Thanks, I have updated the answer. Hope that works for OP's new input. – Ronak Shah Jul 29 '21 at 05:01
@ronak shah I have updated the changes – kinsgter24 Jul 29 '21 at 05:04
@kinsgter24 Try the answer using `strsplit` and `sapply`. – Ronak Shah Jul 29 '21 at 05:07

TarJae · Answer 2 · 2021-07-29T05:57:35.370

3

Update after changing the input: Thanks to Rui Barradas for his support!

First we make a vector with the elements to count. this could be done maybe more elegant.

After that use regex to count:

library(tidyr)
library(dplyr)

df1 <- df %>% 
    select(Drugs) %>% 
    separate_rows(Drugs, sep = ",") %>% 
    separate_rows(Drugs, sep = " and ") %>% 
    separate_rows(Drugs, sep = " ") %>% 
    mutate(Drugs = str_trim(Drugs)) %>% 
    distinct(Drugs) %>% 
    filter(Drugs != "")

my_expression <- paste(df1$Drugs, collapse="|")

df %>% 
    mutate(number = lengths(gregexpr(my_expression, Drugs)), .before=2)

Output:

  Drugs                number Gender year 
  <chr>                 <int> <chr>  <chr>
1 met,met,sulp,DPP          4 M      2020 
2 met and sulp and DPP      3 M      2021 
3 SGLT SGLT SGLT            3 M      2018 
4 Incretin, AGI, AGI        3 F      2019 
5 THK, USP                  2 F      2013

edited Jul 29 '21 at 05:57

answered Jul 29 '21 at 04:56

TarJae

72,363
6
19
66

You may want to check out the *new* input. – Rui Barradas Jul 29 '21 at 04:57
1

Also, you don't need `regmatches`, the return value of `gregexpr` is a vector of positions, with several attributes set, so `lengths(gregexpr(.))` will count the number of matches. – Rui Barradas Jul 29 '21 at 05:01
Please see my update. – TarJae Jul 29 '21 at 06:00

score 1 · Answer 3 · answered Jul 29 '21 at 05:54

Assuming that you have a more unclean data and can have leading white spaces, I propose this approach

library(tidyverse)
df <- read.table(header = TRUE, text = "Drugs                                      Gender   year
'met,met,sulp,DPP '                           M       2020
'met and sulp and DPP '                       M       2021
'SGLT SGLT SGLT  '                            M       2018
'Incretin, AGI, AGI '                         F       2019
'THK, USP'                                    F       2013")

df %>%
  mutate(number_of_drugs = map(str_split(gsub('\\sand\\s|\\W+', ' ', str_trim(Drugs)), '\\s+'), ~ length(unique(.x))))
#>                   Drugs Gender year number_of_drugs
#> 1     met,met,sulp,DPP       M 2020               3
#> 2 met and sulp and DPP       M 2021               3
#> 3      SGLT SGLT SGLT        M 2018               1
#> 4   Incretin, AGI, AGI       F 2019               2
#> 5              THK, USP      F 2013               2

^{Created on 2021-07-29 by the reprex package (v2.0.0)}

How to count string with unique and enter them in another column in R

3 Answers3