0

I am using a regex to remove HTML tags. I do something like - result.replaceAll("\<.*?\>", "");

However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?

Suchi
  • 9,989
  • 23
  • 68
  • 112
  • 7
    For the love of all things kind and sane do not Regex HTML - Use a parser, please. – zellio Jun 14 '11 at 18:13
  • 1
    Can you please explain in more detail. I'd think that something like `` should be removed with your regex. – Howard Jun 14 '11 at 18:13
  • 2
    What happens if the html contains something like `x < 7`? You can't parse/process HTML with a regex. See this answer for why: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 – Marc B Jun 14 '11 at 18:14
  • @Marc B, that isn't a valid example, since you would need to escape the < using < – Mikola Jun 14 '11 at 18:23
  • @Mimisbrunnr: I agree that *parsing* HTML with regexes is impossible (by the pumping lemma), but he is just doing a lexical analysis which is a regular language (ie just separate out the tokens). There is no reason why regexes would not work in this case. – Mikola Jun 14 '11 at 18:26
  • @Mikola -- See Marc B's comment above. Rarely is HTML Valid, it's not XHTML. – zellio Jun 14 '11 at 18:28
  • In HTML, < is not a valid token, unless it appears in a tag. If you fed that into any browser it would barf. EDIT: Actually, I just tried it and apparently firefox actually parses it anyway, even though it isn't a valid document. Go figure... – Mikola Jun 14 '11 at 18:31
  • 1
    @Mikola Actually there is a simple counterexample: `>>>ALT>>>`. It really isn't a good idea to use regex for something like HTML. – Howard Jun 14 '11 at 18:32
  • @Howard: Ok, I buy that example. – Mikola Jun 14 '11 at 18:35
  • @Mikola: That's why you can't parse HTML with a regex with any form of reliability: You know > isn't valid unless it's encoded, we know it's not valid unless it's encoded. The rest of the world couldn't care less because their browser renders it "right" anyways. – Marc B Jun 14 '11 at 18:35
  • From my painful experience - if you are using regular expressions to process HTML, don't. HTML is not a regular language and hence cannot be parsed by regular expressions. – Jarek Przygódzki Jun 14 '11 at 18:41

7 Answers7

2

If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:

String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);

OUTPUT

Output: 123 abd foo
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    What if you try this on the example Howard posted? – Mikola Jun 14 '11 at 18:40
  • @Mikola: Even I have recommended against Regex in my answer. I have answered this because OP was insisting on using simple regex. HTML is so irregular that any regex will fail to take care of those fringe cases. – anubhava Jun 14 '11 at 18:52
1

To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.

Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.

Jeff Foster
  • 43,770
  • 11
  • 86
  • 103
1

Another suggestion is HtmlCleaner

Thor
  • 6,607
  • 13
  • 62
  • 96
0

I have been able achieve do this with the below code snippet.

String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");

I used the above regex to clean the img tags in my RSS content.

Ajith M A
  • 3,838
  • 3
  • 32
  • 55
0

use html parser instead. iterate over the object, print however you like and get the best result.

Hayati Guvence
  • 718
  • 2
  • 6
  • 22
0

I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.

Richard H
  • 38,037
  • 37
  • 111
  • 138
0

So, a piece of code for you. I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.

Basically it looks like this:

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

    ...

    String html; /* read your HTML into variable 'html' */
    String result=null;
    ....
    try {
        Parser p = new Parser(html);
        NodeList nodes = p.parse(null);
        result = nodes.asString();
    } catch (ParserException e) {
        e.printStackTrace();
    }

That will give you plain text stripped of tags (but no substitutes like &amp; would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.

Gleb Varenov
  • 2,795
  • 3
  • 18
  • 18