0

I have to parse string like this:

foo <img ... > <strong>foo</strong> bar

and i need to replace img tag with an empty string

foo <strong>foo</strong> bar

I've tried with

<img.*>

but the result is

foo bar

How can i do?

PS: the html string is malformed

1 Answers1

1

To match the tast of SO this answer will have three parts * Answer to your problem * Official rant * Cleaner soulution

Answer to the problem

* is greedy so it will match to much. Two solutions are possible:

1.) *? non greedy match all 2.) <[^>]+> all within brackets

Rant

Never parse HTML using regex. There are many subtele errors you can run into. There is also this post on this: RegEx match open tags except XHTML self-contained tags

Cleaner soultion

Parse using XML-Parser with TagSoup https://hackage.haskell.org/package/tagsoup. Here is an example that lets you treat HTML as XML like structure with Scala and tagsoup: https://github.com/daandi/spOCR/blob/master/src/main/scala/biz/neumann.parser/HTMLParser.scala

Community
  • 1
  • 1
Andreas Neumann
  • 10,734
  • 1
  • 32
  • 52