0

I have this regex to extract paragraphs that are outside of a table

((?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:tbl>)|(?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:sectPr .*>))

The problem is that it reads all paragraphs as if they are one paragraph (from the first opening tag until the last closing tag without the intermediate paragraphs). Below is an example of the text. In this case it match one instead of 3

</w:tr></w:tbl><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00346D4D" w:rsidRPr="00AC7B53" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:tbl><w:tblPr>

Any help to make it match each paragraph alone (3 paragraphs)?

Thanks.

SAliaMunch
  • 67
  • 9
  • Your sample string only has one tag `` so your lookbehind only matches the first paragraph. Then your pattern end with a look ahead for the next `` tag which comes after 3 paragraphs so it groups the 3 together. Try remove the look behind and look ahead and you should get your 3 groups separate: `(]*>.*?<\/w:p>|]*>.*?<\/w:p>(?=))` – dvo Sep 27 '19 at 15:35
  • This is a sample string, the complete file contains many tables. i.e., , also if the look ahead and behind are removed, the regex will match paragraphs inside the tables, which have to be not matched – SAliaMunch Sep 27 '19 at 15:39
  • In that case I would capture what you are capturing now - all paragraphs outside of tables, and then use another regex and c# to split those three apart. – dvo Sep 27 '19 at 15:42

1 Answers1

0

I think, you can't, because you want to create groups inside another tags, but regex don't know about structures it just looking string from begin to end, assume string: eabcabce if need all abc groups I can do next (abc), however I can't tell that I want all abc groups between e.

You can use some xml parser.

You can try two regexes for this particular case:

  1. Get content of tbl tag with your regex enter image description here
  2. Get groups from tbl content with this regex (<w:p [^>]*>.*?<\/w:p>) enter image description here

some links:

  1. why not to parse html with regex (I think your xml is close to html :)) RegEx match open tags except XHTML self-contained tags
  2. https://www.regextester.com/
Nikita
  • 1,019
  • 2
  • 15
  • 39