0

i need some help extracting specific code numbers from a character string in R. For example i have the next data:

AMOXICIL/CLAVULAN 875/125 MG CM/CM REC (100000126)
HIDROCLOROTIAZIDA CM (50 MG) CONTENIDO (100028929)
ZIDOVUDINA 10 MG/ML O 50 MG/5 ML SOL ORAL O JARABE (500001802)

I need the code numbers (with 9 numbers ALWAYS) who appears at the end of the character string. Finally create a need column in my data frame with:

                                                             1         2
            AMOXICIL/CLAVULAN 875/125 MG CM/CM REC (100000126) 100000126
            HIDROCLOROTIAZIDA CM (50 MG) CONTENIDO (100028929) 100028929
ZIDOVUDINA 10 MG/ML O 50 MG/5 ML SOL ORAL O JARABE (500001802) 500001802

I appreciate any help.

3 Answers3

3

You can use sub to extract 9 digit number at the end of the string.

sub('.*\\((\\d{9})\\)$', '\\1', df$V1)
#[1] "100000126" "100028929" "500001802"

You can wrap as.numeric to convert this string into a number.

Similar using str_extract from stringr.

stringr::str_extract(df$V1, '\\d{9}(?=\\))')
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • What if i have some strings like "AMOXICILIN 300MG(100005324)" (no space within code number and rest of description), "IBUPROFENO 100002345" (no "(" like pattern). just for sure i know that the code have 9 digits. – Diego Gonzalez Avalos Jul 13 '20 at 15:08
  • @DiegoGonzalezAvalos In that case, you can do `stringr::str_extract(df$V1, '\\d{9}')` tp get 9 digit number in the string. – Ronak Shah Jul 13 '20 at 23:41
0

Maybe not the most elegant solution:

#Data
df <- structure(list(V1 = c("AMOXICIL/CLAVULAN 875/125 MG CM/CM REC (100000126)", 
"HIDROCLOROTIAZIDA CM (50 MG) CONTENIDO (100028929)", "ZIDOVUDINA 10 MG/ML O 50 MG/5 ML SOL ORAL O JARABE (500001802)"
)), row.names = c(NA, -3L), class = "data.frame")

#Code
df$index <- gsub(')','',gsub("^.*\\(","", df$V1 ))

                                                              V1     index
1             AMOXICIL/CLAVULAN 875/125 MG CM/CM REC (100000126) 100000126
2             HIDROCLOROTIAZIDA CM (50 MG) CONTENIDO (100028929) 100028929
3 ZIDOVUDINA 10 MG/ML O 50 MG/5 ML SOL ORAL O JARABE (500001802) 500001802
Duck
  • 39,058
  • 13
  • 42
  • 84
0

Some tideverse options:

xx <- c("AMOXICIL/CLAVULAN 875/125 MG CM/CM REC (100000126)", "HIDROCLOROTIAZIDA CM (50 MG) CONTENIDO (100028929)", "ZIDOVUDINA 10 MG/ML O 50 MG/5 ML SOL ORAL O JARABE (500001802)")


stringr::str_sub(xx, -11) %>% readr::parse_number()   # outputs as numeric
stringr::str_sub(xx, -10, -2)                         # outputs as character
cephalopod
  • 1,826
  • 22
  • 31