0

I am trying to use Regular Expressions for the first time to search for images and scripts in webpages in Scala. The expressions I've come up with are

Images:

/(<img\S+\s+\/>)+/

Scripts:

/(<script\s+\S+><\/script>)+/

I don't really know anything about HTML code or using Regex so I'm not sure what I need in order to specify that it should match <img .../> where the ... could be any amount of characters or whitespace. This is just a small part of a programming assignment I'm writing in Scala and we have to use Regex.

possum_pendulum
  • 159
  • 1
  • 4
  • 21

1 Answers1

0

A regex like <img[^>]*> would match <img..........>.

A regex like <script.*?</script> would match a single <script...>...</script> instance. The ? is necessary to prevent it from matching everything from the first <script...> tag to the last </script> tag.

(Feel free to add back in the capturing ( )'s, the \ escapes, and surround with the regex delimiting / / tokens. I removed them to focus on the regular expressions themselves, without the leaning toothpick syndrome and other noise.)

While these are better than the ones you proposed, they will still break in many circumstances. RegEx is not designed to parse HTML.

<script>
  <!-- This "</script>" doesn't end the script, but fools the RegEx -->
</script>
AJNeufeld
  • 8,526
  • 1
  • 25
  • 44