2

I need to extract the content of an HTML tag using RegEx. The body of text I'm searching looks like this:

<div class="content">
    The Price is <script type="text/javascript">document.write(123())</script>
</div>

I tried to use this expression, but it fails. I need to extract the "document.write(123())"

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script></div>

How can I modify my expression to get what I'm after?

Blumer
  • 5,005
  • 2
  • 33
  • 47
Kathick
  • 1,395
  • 5
  • 19
  • 30
  • 6
    Because you shouldn't do it like that. http://stackoverflow.com/q/1732348/1015495 – Mike G Mar 05 '13 at 20:01
  • 8
    The most common reason for a regex to fail is because it's a wrong tool for the job :) – Sergey Kalinichenko Mar 05 '13 at 20:01
  • 2
    You actually have requirements that **require** you to use a regex? – jahroy Mar 05 '13 at 20:02
  • 4
    As I once heard: If you had a problem and you are solving it with regular expressions, you now have two problems. :) – Chris Cooper Mar 05 '13 at 20:03
  • I have the requirement that need to do it in regex . Else i would use Jsoup happily – Kathick Mar 05 '13 at 20:04
  • 2
    "I have the requirement that need to do it in regex". You mean it's like an exercise or assignment in a regex class? I ask that because, otherwise, there's no reason to really use a regex for that. – Filipe Fedalto Mar 05 '13 at 20:06
  • Try an inverted character class `([^\<]+)` instead of matching all non-newline characters `(.*?)`, but pray that your JavaScript doesn't doesn't use a `<` character :-) – emallove Mar 05 '13 at 20:08

3 Answers3

1

There are a couple of problems with your regular expression:

  • What is (?s)?
  • You are not accounting for the space between </script> and </div>
  • The forward slashes (/) I believe need to be escaped, i.e., \/

This seems to work (DEMO):

<div class="content">[^<]*<script type="text\/javascript">(.*?)<\/script>[^<]*<\/div>
mellamokb
  • 56,094
  • 12
  • 110
  • 136
1

You just forgot to account for spaces between <script> and <div>

(?s)<div class="content">[^<]*<script type="text/javascript">(.*?)</script>\s*</div>

nicopico
  • 3,606
  • 1
  • 28
  • 30
1

Extracting content from HTML using Regex is a sure road to madness. It's worse than idea of validating email addresses with Regex.

If you are using C#/.NET I can recommend HtmlAgility pack which does awesome job at extracting content from any HTML (there is a good answer here on StackOverflow that shows how to use it).

If you are using some other technology just look for alternative libraries that do that same thing - you are sure to find that somebody else already solved this problem.

Community
  • 1
  • 1
nikib3ro
  • 20,366
  • 24
  • 120
  • 181