0
String s= "(See <a href=\"/wiki/Grass_fed_beef\" title=\"Grass fed beef\" " +
          "class=\"mw-redirect\">grass fed beef.) They have been used for " +
          "<a href=\"/wiki/Paper\" title=\"Paper\">paper-making since " +
          "2400 BC or before.";

In the string above I have inter-mixed html with text.

Well the requirement is that the output looks like:-

They have been used for paper-making since 2400 BC or before.

Could some one help me with a generic regular expression that would produce the desired output from the given input?

Thanks in advance!

Simon Nickerson
  • 42,159
  • 20
  • 102
  • 127
leba-lev
  • 2,788
  • 10
  • 33
  • 43

2 Answers2

1

https://stackoverflow.com/questions/1732348#1732454

You have been warned.

Community
  • 1
  • 1
jjnguy
  • 136,852
  • 53
  • 295
  • 323
  • I'm sorry but I am new to this. Could you please tell me what the warning was? I might have not understood. – leba-lev May 27 '10 at 22:02
  • 2
    In a less horror-blockbuster tone: he is warning you that regular expressions **should not** be used to parse (X)HTML. – nc3b May 27 '10 at 22:04
  • @rookie Basically the point is that Regular expressions are not good for parsing html. Unless you have a very specific case. You should use an HTML parser tool instead. – jjnguy May 27 '10 at 22:04
  • Yes, I have used the Jericho HtmlParser. But these are specific cases and I can't seem to figure out a good enough regular expression to deal with these cases. The warning comment really left me stumped right there. :). – leba-lev May 27 '10 at 22:07
1

The following expression:

\([^)]*?\)|<[a-zA-Z/][^>]*?>

will match anything that looks like an HTML tag and any parenthesized text. Replace said text with "", and there ya go.

Note: If you try to match any string that has script tags in it, or "HTML" where the author didn't bother to escape < and > when they weren't used as tag delimiters), or a ( without a ), things will probably not work as you'd hoped.

cHao
  • 84,970
  • 20
  • 145
  • 172
  • Thank you very much for your help. I'm sorry for any inconvenience with the way I've framed my question. But I thank you for understanding. I will make sure that I state my objectives better the next time. If its not too much of a bother, I can't seem to understand how this regular expression does the trick. Would it be possible for you to break it down? If not, that is okay too, I will try to figure it out. Thanks again for your help. – leba-lev May 27 '10 at 22:29
  • 1
    It's actually two parts. The first is \([^)]*?\), which will match a (, any number of chars that aren't ) (as few as possible, though -- hence the ?), and then a ). The second part is <[a-zA-Z/][^>]*?>, which will match an opening <, a letter (to try and avoid matching mistakenly unescaped <'s), and everything up to the next > the same way the () part works. The | between them means "or", so if either part matches, the expression matches. – cHao May 27 '10 at 22:45
  • 1
    The ?'s can actually be taken out, now that i think about it. It'd never match past the first delimiter, since we're specifying that the delimiter can never be part of the inner string. – cHao May 27 '10 at 22:54