Regex - extract substrings starts and ends with specific patterns (HTML parsing)?

Question

I am parsing some well-organized strings(HTML format) to extract data. The Format is like(newline added for reading convinience):

<span><h2>Category 1</h2>
<p><strong><u>Entry 1</u></strong></p>
<ul><li>Some Data</li></ul>
<h2>Category 2</h2>
<p><strong><u>Entry 2</span>
<ul><li>Some Data</li></ul>
</span>

I intend to find all strings between <h2> and extract strings after </h2> first. The searching pattern is /<h2>Tier.*?<\/h2>(.*?)(<h2>|<\/span>)/g. But each matching substring is exactly ending with <h2>. So the next category will not be extracted, while the third category block is fine because there is a new searching.

Then I try to search for substrings which not contains <g2> greedily. The pattern is h2>Category.*?<\/h2>(^(h2).)*. It does not work though.

Where is "Tier" text in your HTML code??? – Envy Jun 24 '19 at 09:23 — Envy, Jun 24 '19 at 09:23
Copied wrong strings, edited. – Varg Nord Jun 24 '19 at 09:36 — Varg Nord, Jun 24 '19 at 09:36
Use beautifulsoup. – Casimir et Hippolyte Jun 24 '19 at 10:23 — Casimir et Hippolyte, Jun 24 '19 at 10:23

score 1 · Answer 1 · answered Jun 24 '19 at 09:33

1

Try extract by this regex:

<h2>\K[^<]+

Here Is Demo

Good Luck!

answered Jun 24 '19 at 09:33

score 0 · Answer 2 · answered Jun 24 '19 at 09:59

Your question is not clearly and makes me so confused.

But I think you want this:

<h2>[^<]+<\/h2>(.+?<\/ul>)

Demo: https://regex101.com/r/k16AoN/2

Beside of that, you should refer this: What is the difference between HTML tags <div> and <span>?. Maybe you use wrong <span> tag

Regex - extract substrings starts and ends with specific patterns (HTML parsing)?

2 Answers2