Regular expressions in java

Question

String s= "(See <a href=\"/wiki/Grass_fed_beef\" title=\"Grass fed beef\" " +
          "class=\"mw-redirect\">grass fed beef.) They have been used for " +
          "<a href=\"/wiki/Paper\" title=\"Paper\">paper-making since " +
          "2400 BC or before.";

In the string above I have inter-mixed html with text.

Well the requirement is that the output looks like:-

They have been used for paper-making since 2400 BC or before.

Could some one help me with a generic regular expression that would produce the desired output from the given input?

Thanks in advance!

score 1 · Answer 1 · edited May 23 '17 at 12:11

1

https://stackoverflow.com/questions/1732348#1732454

You have been warned.

edited May 23 '17 at 12:11

Community

1
1

answered May 27 '10 at 21:57

jjnguy

136,852
53
295
323

I'm sorry but I am new to this. Could you please tell me what the warning was? I might have not understood. – leba-lev May 27 '10 at 22:02
2

In a less horror-blockbuster tone: he is warning you that regular expressions **should not** be used to parse (X)HTML. – nc3b May 27 '10 at 22:04
@rookie Basically the point is that Regular expressions are not good for parsing html. Unless you have a very specific case. You should use an HTML parser tool instead. – jjnguy May 27 '10 at 22:04
Yes, I have used the Jericho HtmlParser. But these are specific cases and I can't seem to figure out a good enough regular expression to deal with these cases. The warning comment really left me stumped right there. :). – leba-lev May 27 '10 at 22:07

cHao · Accepted Answer · 2010-05-27T22:19:17.843

1

The following expression:

\([^)]*?\)|<[a-zA-Z/][^>]*?>

will match anything that looks like an HTML tag and any parenthesized text. Replace said text with "", and there ya go.

Note: If you try to match any string that has script tags in it, or "HTML" where the author didn't bother to escape < and > when they weren't used as tag delimiters), or a ( without a ), things will probably not work as you'd hoped.

edited May 27 '10 at 22:19

answered May 27 '10 at 22:09

cHao

84,970
20
145
172

Thank you very much for your help. I'm sorry for any inconvenience with the way I've framed my question. But I thank you for understanding. I will make sure that I state my objectives better the next time. If its not too much of a bother, I can't seem to understand how this regular expression does the trick. Would it be possible for you to break it down? If not, that is okay too, I will try to figure it out. Thanks again for your help. – leba-lev May 27 '10 at 22:29
1

It's actually two parts. The first is \([^)]*?\), which will match a (, any number of chars that aren't ) (as few as possible, though -- hence the ?), and then a ). The second part is <[a-zA-Z/][^>]*?>, which will match an opening <, a letter (to try and avoid matching mistakenly unescaped <'s), and everything up to the next > the same way the () part works. The | between them means "or", so if either part matches, the expression matches. – cHao May 27 '10 at 22:45
1

The ?'s can actually be taken out, now that i think about it. It'd never match past the first delimiter, since we're specifying that the delimiter can never be part of the inner string. – cHao May 27 '10 at 22:54

Regular expressions in java

2 Answers2