10

I want to extract the plain text from given HTML code. I tried using regex and got

String target = val.replaceAll("<a.*</a>", "");.

My main requirement is I want remove everything between <a> and </a> (including the Link name). While using the above code all other contents also removed.

<a href="www.google.com">Google</a> This is a Google Link

<a href="www.yahoo.com">Yahoo</a> This is a Yahoo Link

Here I want to remove the values between <a> and </a>. Final output should

This is a Google Link This is a Yahoo Link

Sathesh S
  • 1,253
  • 3
  • 23
  • 54
  • 9
    [**TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ**](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – adeneo Jan 01 '14 at 10:39
  • This is what jQuery is good at. – Rimian Jan 01 '14 at 10:46
  • What's the leading "String" for? Is this Javascript? – ChaseMoskal Jan 01 '14 at 10:55
  • In Javascript, wouldn't this be something more like `var clean_string = my_string.replace(//i,'');` – ChaseMoskal Jan 01 '14 at 11:00

1 Answers1

27

Use a non-greedy quantifier (*?). For example, to remove the link entirely:

String target = val.replaceAll("<a.*?</a>", "");

Or to replace the link with just the link tag's contents:

String target = val.replaceAll("<a[^>]*>(.*?)</a>", "This is a $1 Link");

However, I would still recommend using a proper DOM manipulation API.

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331