1

I have a dataset with 12000+ records, that looks like below which I need to count the strings. the dataset looks like

Drugs                                      Gender   year
met,met,sulp,DPP                            M       2020
met and sulp and DPP                        M       2021
SGLT SGLT SGLT                              M       2018
Incretin, AGI, AGI                          F       2019
THK, USP                                    F       2013

I need output like this kindly suggest me

Drugs                      number of drugs  Gender  year
met,met,sulp,DPP                3             M     2020
met and sulp and DPP            3             M     2021
SGLT SGLT SGLT                  1             M     2018
Incretin, AGI, AGI              2             F     2019
THK, USP                        2             F     2013

Thanks in advance

kinsgter24
  • 39
  • 5
  • how is the 1st value 5? Do you also want to count duplicate values separately ? Try `stringr::str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))` – Ronak Shah Jul 29 '21 at 04:44

3 Answers3

3

You can use stringr::str_count to count number of 'DRUG' values.

library(stringr)

df$num_drugs <- str_count(df$Drugs, regex('DRUG', ignore_case = TRUE))

To count the unique values you can use -

df$num_drugs <- sapply(strsplit(df$Drugs, ',\\s*'), function(x) length(unique(x)))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
3

Update after changing the input: Thanks to Rui Barradas for his support!

First we make a vector with the elements to count. this could be done maybe more elegant.

After that use regex to count:

library(tidyr)
library(dplyr)

df1 <- df %>% 
    select(Drugs) %>% 
    separate_rows(Drugs, sep = ",") %>% 
    separate_rows(Drugs, sep = " and ") %>% 
    separate_rows(Drugs, sep = " ") %>% 
    mutate(Drugs = str_trim(Drugs)) %>% 
    distinct(Drugs) %>% 
    filter(Drugs != "")

my_expression <- paste(df1$Drugs, collapse="|")

df %>% 
    mutate(number = lengths(gregexpr(my_expression, Drugs)), .before=2)

Output:

  Drugs                number Gender year 
  <chr>                 <int> <chr>  <chr>
1 met,met,sulp,DPP          4 M      2020 
2 met and sulp and DPP      3 M      2021 
3 SGLT SGLT SGLT            3 M      2018 
4 Incretin, AGI, AGI        3 F      2019 
5 THK, USP                  2 F      2013 
TarJae
  • 72,363
  • 6
  • 19
  • 66
1

Assuming that you have a more unclean data and can have leading white spaces, I propose this approach

library(tidyverse)
df <- read.table(header = TRUE, text = "Drugs                                      Gender   year
'met,met,sulp,DPP '                           M       2020
'met and sulp and DPP '                       M       2021
'SGLT SGLT SGLT  '                            M       2018
'Incretin, AGI, AGI '                         F       2019
'THK, USP'                                    F       2013")

df %>%
  mutate(number_of_drugs = map(str_split(gsub('\\sand\\s|\\W+', ' ', str_trim(Drugs)), '\\s+'), ~ length(unique(.x))))
#>                   Drugs Gender year number_of_drugs
#> 1     met,met,sulp,DPP       M 2020               3
#> 2 met and sulp and DPP       M 2021               3
#> 3      SGLT SGLT SGLT        M 2018               1
#> 4   Incretin, AGI, AGI       F 2019               2
#> 5              THK, USP      F 2013               2

Created on 2021-07-29 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45