Regex to get all content after the first closing html tag in java

Question

Regex to get all the content after the first closing tag ">" encountered before

"<a href " till the end.

How do I get that? I'm not good at regex :/

eg:

<img class="abc" src="abc.jpg"> blah blah blah&nbsp;<a 
href="http://en.wikipedia.org/wiki">abc defg hijk lmnop</a>&nbsp; blah

Expected output:

blah blah blah abc defg hijk lmnop blah

http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – Thomas Apr 01 '14 at 16:52 — Thomas, Apr 01 '14 at 16:52

score 0 · Answer 1 · answered Apr 01 '14 at 16:54

0

Try this one:

htmls = htmls.replaceAll(".*?>(?=.*?<a href)", "");

It means remove everything until the closing tag, which is before the first <a href

answered Apr 01 '14 at 16:54

Sabuj Hassan

score 0 · Answer 2 · edited May 23 '17 at 12:05

0

Long story short, you cannot parse HTML with a Regex because HTML is not a regular language. See here for a full discussion.

edited May 23 '17 at 12:05

Community

answered Apr 01 '14 at 16:57

tylerstacey

2 Answers2