Regex fails with html snippet

Question

I need to extract the content of an HTML tag using RegEx. The body of text I'm searching looks like this:

<div class="content">
    The Price is <script type="text/javascript">document.write(123())</script>
</div>

I tried to use this expression, but it fails. I need to extract the "document.write(123())"

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script></div>

How can I modify my expression to get what I'm after?

Because you shouldn't do it like that. http://stackoverflow.com/q/1732348/1015495 — Mike G, Mar 05 '13 at 20:01
The most common reason for a regex to fail is because it's a wrong tool for the job :) — Sergey Kalinichenko, Mar 05 '13 at 20:01
You actually have requirements that **require** you to use a regex? — jahroy, Mar 05 '13 at 20:02
As I once heard: If you had a problem and you are solving it with regular expressions, you now have two problems. :) — Chris Cooper, Mar 05 '13 at 20:03
I have the requirement that need to do it in regex . Else i would use Jsoup happily — Kathick, Mar 05 '13 at 20:04
"I have the requirement that need to do it in regex". You mean it's like an exercise or assignment in a regex class? I ask that because, otherwise, there's no reason to really use a regex for that. — Filipe Fedalto, Mar 05 '13 at 20:06
Try an inverted character class `([^\<]+)` instead of matching all non-newline characters `(.*?)`, but pray that your JavaScript doesn't doesn't use a `<` character :-) — emallove, Mar 05 '13 at 20:08

score 1 · Accepted Answer · answered Mar 05 '13 at 20:07

There are a couple of problems with your regular expression:

What is (?s)?
You are not accounting for the space between </script> and </div>
The forward slashes (/) I believe need to be escaped, i.e., \/

This seems to work (DEMO):

<div class="content">[^<]*<script type="text\/javascript">(.*?)<\/script>[^<]*<\/div>

score 1 · Answer 2 · answered Mar 05 '13 at 20:08

1

You just forgot to account for spaces between <script> and <div>

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script>\s*</div>

answered Mar 05 '13 at 20:08

nicopico

3,606
1
28
30

score 1 · Answer 3 · edited May 23 '17 at 12:27

Extracting content from HTML using Regex is a sure road to madness. It's worse than idea of validating email addresses with Regex.

If you are using C#/.NET I can recommend HtmlAgility pack which does awesome job at extracting content from any HTML (there is a good answer here on StackOverflow that shows how to use it).

If you are using some other technology just look for alternative libraries that do that same thing - you are sure to find that somebody else already solved this problem.

Regex fails with html snippet

3 Answers3