0

I'm trying to get an article ID from a webpage, Im currently using xpath. But I don't know anything about regex, I think REGEX is the solution.

This is an example of the original code:

<div id="description" class="panel" style="display: block; overflow: hidden;"><h2 class="open">Descripción</h2><div class="productDescription">NEUROBION INYECTABLE X 3 AMPOLLAS<br><br>Modo de Uso: Vía Intramuscular<br>Componente Activo:Vitamina B1 (Tiamina) 100 Mg, Vitamina B6 (Piridoxina) 100 Mg Y Vitamina B12 (Cianocobalamina) 1 Mg. Solucion Inyectable Con Tecnología Doble Camara. <br>INVIMA 2015M-13939-R2<br><img class="" data-src="/arquivos/RX.png?v=636054173313030000" src="/arquivos/RX.png?v=636054173313030000"></div></div>

These are 2 examples that I got from screaming frog using xpath:

<div class="productDescription">NEUROBION 100MG/150MG CAJA X 30 TABLETAS<br><br>Modo de Uso: Vía Oral<br>Componente Activo: Vitamina B1 (Tiamina) Y Vitamina B6 (Piridoxina)<br>INVIMA 2019M-0009578-R1</div>

<div class="productDescription">NEUROBION INYECTABLE X 3 AMPOLLAS<br><br>Modo de Uso: Vía Intramuscular<br>Componente Activo:Vitamina B1 (Tiamina) 100 Mg, Vitamina B6 (Piridoxina) 100 Mg Y Vitamina B12 (Cianocobalamina) 1 Mg. Solucion Inyectable Con Tecnología Doble Camara. <br>INVIMA 2015M-13939-R2<br><img class="" data-src="/arquivos/RX.png?v=636054173313030000" src="/arquivos/RX.png?v=636054173313030000"></div>

But I just want to get this:

INVIMA 2015M-13939-R2
INVIMA 2019M-0009578-R1

This is what I already have done with xpath

//div[@id="description"]//div

Can somebody help me with the Regex Formula?

I also tried with this:

["'](INVIMA .*?)["']
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
Juan Pino
  • 21
  • 2
  • You might need to look at [this well-known answer](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Rob Aug 11 '21 at 19:01
  • Maybe use a parser is a better idea but if you can't use it or don't want this pattern based on your example works: `(?<=
    )[\w -]+(?=<\/?(?:div|br)>)` See Demo: https://regex101.com/r/8QxasW/1
    – Alireza Aug 11 '21 at 19:09
  • `INVIMA` is always in your texts? – Alireza Aug 11 '21 at 19:11
  • If `INVIMA` is always in your texts this pattern is better `(?<=INVIMA )[^<]+` See Demo: https://regex101.com/r/8QxasW/2 – Alireza Aug 11 '21 at 19:14

1 Answers1

0

Use

INVIMA [A-Z0-9-]+

See regex proof.

EXPLANATION

--------------------------------------------------------------------------------
  INVIMA                   'INVIMA '
--------------------------------------------------------------------------------
  [A-Z0-9-]+               any character of: 'A' to 'Z', '0' to '9',
                           '-' (1 or more times (matching the most
                           amount possible))
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37