1

I am trying to parse the code

  <p class=3DMsoNormal style=3D'mso-layout-grid-align:none'><span style=3D'font-size:
8.0pt'>$</span><span lang=3DEN-US style=3D'font-size:8.0pt;mso-ansi-language:
  EN-US'>ogrnNomer</span><span style=3D'font-size:8.0pt'>$</span><span
  style=3D'font-size:8.0pt;background:yellow;mso-highlight:yellow;mso-fareast-language:
  RU'><o:p></o:p></span></p>

with regular expression

(<((?!<).)*(:\n)((?!<).)*>)*(((?!<).)*)(<\/(((?!<).)*)>)*

I expected to get 4 matches with 5 groups for each match.

For example, for match

<span style=3D'font-size:
8.0pt'>$</span>

I expected

1 Group:

<span style=3D'font-size

2 Group:

:(with line break)

3 Group:

8.0pt'>

4 Group:

$

5 Group:

</span>

But I can't achieve this with my regular expression.

How my regular expression should look?

My example available here https://regex101.com/r/5LA8J0/1

Arthur
  • 3,253
  • 7
  • 43
  • 75
  • 2
    [Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) HTML and regex are not good friends. Use a parser, it is simpler, faster and much more maintainable. – Toto Mar 03 '21 at 14:40
  • Try `(<(\w+)[^>]*?)(:\n?)([^>]*>)([^<>]*)(<\/\2>)`, see [this regex demo](https://regex101.com/r/5LA8J0/2). Group 2 is a "technical" group to match the tag name. – Wiktor Stribiżew Mar 03 '21 at 16:04
  • Oh, thank you, but this task has been solved about month ago ))) – Arthur Apr 23 '21 at 15:00

0 Answers0