2

This post asks how to extract a string between other two strings in R: Extracting a string between other two strings in R

I'm seeking a similar answer, but now covering multiple occurences between patterns.

Example string:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:

Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

Based on the post above, this code

gsub(".*Fabricante: *(.+) CNPJ:.*", "\\1", df$manufacturing_location[92])

returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

When I change to

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\\1", df$manufacturing_location[92])

it returns the first. I tried changing to \\2 as I thought this would number occurences, but then I get an empty string. I also tried using stringr's str_match_all, but it did not work too.

Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?

I would like to put this into a mutate syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all work.

jay.sf
  • 60,139
  • 8
  • 53
  • 110

3 Answers3

2

We can use str_match_all as follows:

x <- "Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:"
matches <- str_match_all(x, "(?<=\\bFabricante:  ).*?(?= CNPJ:)")[[1]]
matches

     [,1]                                                    
[1,] "EMS S/A"                                               
[2,] "EMS S/A"                                               
[3,] "NOVAMED FABRICA<U+00C7>AO DE PRODUTOS FARMACEUTICOS LTDA"

Here is an explanation of the regex pattern being used:

  • (?<=\\bFabricante: ) lookbehind and assert that Fabricante: precedes
  • .*? then match all content until reaching the nearest
  • (?= CNPJ:) lookahead and assert that CNPJ: follows
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • Thank you @Tim Biegeleisen! This might be a stretch, but would you mind telling me what changed in the Regex expression to make it work? – Joao Francisco Pugliese Mar 01 '23 at 06:35
  • One issue with this solution is that when I pass this inside a "mutate" verb, the `str_match_all` do not perform this row-by-row; one option would be to `lapply` this function in a column I guess. – Joao Francisco Pugliese Mar 01 '23 at 06:51
  • @JoaoFranciscoPugliese I have added an explanation of the regex pattern. The problem with your `sub` approach is that it's the wrong function for getting all matches. – Tim Biegeleisen Mar 01 '23 at 06:54
0

You could strsplit at the key words and subset to desired elements.

el(strsplit(x, '\\s?\\w*:\\s+'))[c(2, 6, 10)]
# [1] "EMS S/A"                                           "EMS S/A"                                          
# [3] "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA"
jay.sf
  • 60,139
  • 8
  • 53
  • 110
0

It seems that your data is of a debian control file format. You could use read.dcf in base R after adding line breaks to it. Then you can access any column of the data that you want.

read.dcf(textConnection(gsub("(Fabricante)","\n\\1",gsub(" (\\S+:)", "\n\\1", x))),all = TRUE)
                                         Fabricante                 CNPJ                                     Endereço Fabricaçao
1                                           EMS S/A - 57.507.378/0001-01 SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de           
2                                           EMS S/A - 57.507.378/0003-65           HORTOLANDIA - SP - BRASIL Etapa de           
3 NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA - 12.424.020/0001-79                MANAUS - AM - BRASIL Etapa de 

--- breakdown:

gsub(" *(\\S+:)", "\n\\1", x) |> #Every keyword needs to start a new line 
  gsub("(Fabricante)", "\n\\1", x=_) |> #Every row data separated from the previous
  textConnection() |> #  Convert to a file readable object
  read.dcf(all =TRUE) # Read into R

    
Onyambu
  • 67,392
  • 3
  • 24
  • 53