Scraping HTML with Regex

Question

I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...

I'm trying to use Regex to scrape contents between the anchors "<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:

But when I tried this regex, it returns nothing...

<h2\b[^>]*>.*?<\/h2>[(&nbsp;)\t\s]*(.*?)[(&nbsp;)\t\s]*<\/div>

I think it may have something to do with the empty spaces within the HTML source...

Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?

Many thanks

HTML segment

<div id="Highlights">

      <h2>Highlights</h2>

      <ul>

<li>1234</li>

<li>abc def asdasd asdasd</li>

<li>asdasda as asdasdasdas </li>

<li>asdasd asdasdas asdsad asdasd asa</li>

</ul>





     </div>

     <div class="FloatClear"></div>

     <div id="SalesMarquee">

      <div id="SalesMarqueeTemplate" style="display: none;">

Aren't you better of just using a DOM parser for this? Or is there a reason for wanting to RegEx it? — Exelian, Jan 24 '11 at 12:19
i MUST use regex, because I dont have a choice! I'm using an off-the-shelf script which only gives me a textbox to enter the Regex into... — user587064, Jan 24 '11 at 12:22
You would have better luck asking how to change the PHP script you paid for to do the task without regex. — thirtydot, Jan 24 '11 at 12:56

score 0 · Answer 1 · edited May 23 '17 at 12:15

0

Don't use regex to scrape HTML.

See here for compelling reasons why.

Use an HTML parser instead - this SO answer suggests using DOMDocument->loadHTML().

edited May 23 '17 at 12:15

Community

1
1

answered Jan 24 '11 at 12:20

Oded

489,969
99
883
1,009

2

i MUST use regex, because I dont have a choice! I'm using an off-the-shelf script which only gives me a textbox to enter the Regex into... – user587064 Jan 24 '11 at 12:22

score 0 · Accepted Answer · answered Jan 24 '11 at 14:25

0

In this case, because it's so simple, I think you might be able to pull it off with Regex. Although you could probably cater an example where it will fail, it should work in all normal cases. I suppose in this type of code that wouldn't exactly mean a security risk.

The reason it's not working is because of the dot you use in the middle of the expression. By default, the dot matches anything EXCEPT newline. To test, I used [\W\w] instead, which does work (stupid hack to really match anything).

The clean way is to switch your regex into single-line mode using the s switch. How to do that depends on your framework, but usually it's \<regex>\s.

See http://www.regular-expressions.info/dot.html for more info.

answered Jan 24 '11 at 14:25

Joeri Hendrickx

16,947
4
41
53

Thanks Joeri Hendrickx, finally someone answer my question :) BTW, I found another way to match the blank spaces using "[( )\t\s]*" But for some reason the capture group doesnt seem to work. I can see in Rubular I'm matching up the right bits, but I want to capture EVERYTHING surrounded by "
"
– user587064 Jan 24 '11 at 22:34
If this answer helps you, please upvote and accept it. If you need more info, I suggest you edit your question or make another one. Anyway, if you want to capture everything _between_
, you'll need to add those to the expression (together with all the junk to get rid of attributes) and then put a pair of parens around the code between them to group it. Also, add `?:` right after a paren to not use that one as a capturing group. That way you can come to a point where group 1 is the one you want which is probably what your script needs. Group 0 is always the entire match.
– Joeri Hendrickx Jan 25 '11 at 09:29

Scraping HTML with Regex

2 Answers2