-3

I am parsing some well-organized strings(HTML format) to extract data. The Format is like(newline added for reading convinience):

<span><h2>Category 1</h2>
<p><strong><u>Entry 1</u></strong></p>
<ul><li>Some Data</li></ul>
<h2>Category 2</h2>
<p><strong><u>Entry 2</span>
<ul><li>Some Data</li></ul>
</span>

I intend to find all strings between <h2> and extract strings after </h2> first. The searching pattern is /<h2>Tier.*?<\/h2>(.*?)(<h2>|<\/span>)/g. But each matching substring is exactly ending with <h2>. So the next category will not be extracted, while the third category block is fine because there is a new searching.

Then I try to search for substrings which not contains <g2> greedily. The pattern is h2>Category.*?<\/h2>(^(h2).)*. It does not work though.

Varg Nord
  • 33
  • 1
  • 8

2 Answers2

1

Try extract by this regex:

<h2>\K[^<]+

Here Is Demo

Good Luck!

0

Your question is not clearly and makes me so confused.

But I think you want this:

<h2>[^<]+<\/h2>(.+?<\/ul>)

Demo: https://regex101.com/r/k16AoN/2

Beside of that, you should refer this: What is the difference between HTML tags <div> and <span>?. Maybe you use wrong <span> tag

Envy
  • 510
  • 6
  • 19