0

Suppose I have

<img class="size-full wp-image-10225" alt="animals" src="abc.jpg"> blah blah blah&nbsp;
<a href="http://en.wikipedia.org/wiki/Elephant">elephant is an animal</a>&nbsp;blah

I want a regex to give me the output :

blah blah blah <a href="http://en.wikipedia.org/wiki/Elephant">elephant is an animal</a> blah

without the &nbsp;. I can do str.replace("&nbsp;","") separately, but how do I get the string starting from blah blah... until blah (which includes link tag).

ajp15243
  • 7,704
  • 1
  • 32
  • 38
user3298846
  • 69
  • 2
  • 3
  • 10
  • you must remove the `img` tag seperately. Do you only need the a-Tag? That works with RegExpr. If you like to get the other text before and after the tag, here you have problems. Why you dont remove easily unneeded tags? – Adrian Preuss Mar 28 '14 at 19:14
  • I do need the text before the tag as well. So basically I cant say StringUtils.removeHTMLTags() as this removes all the tags and I want the html tag. so basically what I'm thinking is to locate the first ">" before ahref and then capture the text from there till ( inclusive) – user3298846 Mar 28 '14 at 19:17
  • 1
    _Sees regex and HTML in title_ "http://stackoverflow.com/a/1732454/2846923." – The Guy with The Hat Mar 28 '14 at 19:27

1 Answers1

2

Maybe something like this?

^<[^>]*>\s*|&nbsp;

Java escaped:

^<[^>]*>\\s*|&nbsp;

regex101 demo

^<[^>]*>\\s* will match the first img tag and any following spaces. Then replace the &nbsp;. The replacement string is "".

You might want to use a proper HTML parser though, since it'll be less likely to break.

Jerry
  • 70,495
  • 13
  • 100
  • 144