Find a block of descriptive text inside html using regex

Question

I'm trying to figure out how to use look-ahead to try to capture the descriptive text in an html page such as

<div class="itemBanner" style="float:left; padding:10px">
<div style="padding-right:5px; padding-bottom:5px">
<div class="itemBanner">
HTML Tags Stripper is designed to strip HTML tags from the text. It will also strip embedded JavaScript code, style information (style sheets), as well as code inside php/asp tags (&lt;?php ?&gt; &lt;%php ?&gt; &lt;% %&gt;). It will also replace sequence of new line characters (multiple) with only one. <b>Allow tags</b> feature is session sticky, i.e. it will remember allowed tags list, so you will have to type them only once.<p></p>You can either provide text in text area below, or enter URL of the web page. If URL provided then HTML Tags Stripper will visit web-page for its contents.<p></p>
<b>Known issues:</b><br />

I figured a regex that looks for a '>' followed by at least 150 characters before a '<' would do the trick.

The closest I've gotten so far is:

(([^.<]){1,500})<

Which still misses on things like periods and other characters before and after the string.

Don't use regex to parse HTML: http://stackoverflow.com/a/1732454/2812842 — scrowler, May 22 '14 at 03:49
@scrowler He's not parsing, he's just capturing a block of text. — Kache, May 22 '14 at 03:54
Parsing would be the right way to do it. In this case the HTML looks like XHTML, so you could use an XML parser. — dan-gph, May 22 '14 at 04:11

Kache · Answer 1 · 2014-05-22T04:02:38.833

Your regex will match anything that's neither "." nor "<" 1 to 500 times, then a "<".

Assuming you want to capture everything from the itemBanner div until the very next occurrence of a closing div, you can use these elements:

<div class="itemBanner"> - explicit match
() - parathentical wrap for referencing, e.g. match[1]
.*? - any length of characters, non-greedily (as few as possible)
<\/div> - explicit match, with escaped '/'

to form this Ruby regex:

item_banner_div_regex = /<div class="itemBanner">(.*?)<\/div>/
match = item_banner_div_regex.match(html)
inside_item_banner_div = match && match[1]

Note: The exact regex will depend on the implementation you're using.

Find a block of descriptive text inside html using regex

1 Answers1