I am parsing some well-organized strings(HTML format) to extract data. The Format is like(newline added for reading convinience):
<span><h2>Category 1</h2>
<p><strong><u>Entry 1</u></strong></p>
<ul><li>Some Data</li></ul>
<h2>Category 2</h2>
<p><strong><u>Entry 2</span>
<ul><li>Some Data</li></ul>
</span>
I intend to find all strings between <h2>
and extract strings after </h2>
first. The searching pattern is /<h2>Tier.*?<\/h2>(.*?)(<h2>|<\/span>)/g
. But each matching substring is exactly ending with <h2>
. So the next category will not be extracted, while the third category block is fine because there is a new searching.
Then I try to search for substrings which not contains <g2>
greedily. The pattern is h2>Category.*?<\/h2>(^(h2).)*
. It does not work though.