Regex to remove html does not get rid of img tag

Question

I am using a regex to remove HTML tags. I do something like - result.replaceAll("\<.*?\>", "");

However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?

For the love of all things kind and sane do not Regex HTML - Use a parser, please. — zellio, Jun 14 '11 at 18:13
Can you please explain in more detail. I'd think that something like `` should be removed with your regex. — Howard, Jun 14 '11 at 18:13
What happens if the html contains something like `x < 7`? You can't parse/process HTML with a regex. See this answer for why: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454 — Marc B, Jun 14 '11 at 18:14
@Marc B, that isn't a valid example, since you would need to escape the < using < — Mikola, Jun 14 '11 at 18:23
@Mimisbrunnr: I agree that *parsing* HTML with regexes is impossible (by the pumping lemma), but he is just doing a lexical analysis which is a regular language (ie just separate out the tokens). There is no reason why regexes would not work in this case. — Mikola, Jun 14 '11 at 18:26
@Mikola -- See Marc B's comment above. Rarely is HTML Valid, it's not XHTML. — zellio, Jun 14 '11 at 18:28
In HTML, < is not a valid token, unless it appears in a tag. If you fed that into any browser it would barf. EDIT: Actually, I just tried it and apparently firefox actually parses it anyway, even though it isn't a valid document. Go figure... — Mikola, Jun 14 '11 at 18:31
@Mikola Actually there is a simple counterexample: ``. It really isn't a good idea to use regex for something like HTML. — Howard, Jun 14 '11 at 18:32
@Mikola: That's why you can't parse HTML with a regex with any form of reliability: You know > isn't valid unless it's encoded, we know it's not valid unless it's encoded. The rest of the world couldn't care less because their browser renders it "right" anyways. — Marc B, Jun 14 '11 at 18:35
From my painful experience - if you are using regular expressions to process HTML, don't. HTML is not a regular language and hence cannot be parsed by regular expressions. — Jarek Przygódzki, Jun 14 '11 at 18:41

score 2 · Answer 1 · answered Jun 14 '11 at 18:30

2

If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:

String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);

OUTPUT

Output: 123 abd foo

answered Jun 14 '11 at 18:30

anubhava

761,203
64
569
643

1

What if you try this on the example Howard posted? – Mikola Jun 14 '11 at 18:40
@Mikola: Even I have recommended against Regex in my answer. I have answered this because OP was insisting on using simple regex. HTML is so irregular that any regex will fail to take care of those fringe cases. – anubhava Jun 14 '11 at 18:52

score 1 · Accepted Answer · answered Jun 14 '11 at 18:20

1

To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.

Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.

answered Jun 14 '11 at 18:20

Jeff Foster

43,770
11
86
103

I had a look at JSoup before, but its pretty huge. I need a small lightweight library, or code. – Suchi Jun 14 '11 at 18:23
NekoHTML is another option, and Thor has mentioned HTMLCleaner. – Jeff Foster Jun 14 '11 at 18:26

score 1 · Answer 3 · answered Jun 14 '11 at 18:24

1

Another suggestion is HtmlCleaner

answered Jun 14 '11 at 18:24

Thor

6,607
13
62
96

score 0 · Answer 4 · answered Apr 24 '17 at 05:18

I have been able achieve do this with the below code snippet.

String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");

I used the above regex to clean the img tags in my RSS content.

Hayati Guvence · Answer 5 · 2012-10-02T00:52:37.057

0

use html parser instead. iterate over the object, print however you like and get the best result.

edited Oct 02 '12 at 00:52

answered Jun 14 '11 at 19:12

Hayati Guvence

718
2
6
22

score 0 · Answer 6 · answered Jun 14 '11 at 19:17

I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.

score 0 · Answer 7 · answered Jun 14 '11 at 19:24

So, a piece of code for you. I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.

Basically it looks like this:

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

    ...

    String html; /* read your HTML into variable 'html' */
    String result=null;
    ....
    try {
        Parser p = new Parser(html);
        NodeList nodes = p.parse(null);
        result = nodes.asString();
    } catch (ParserException e) {
        e.printStackTrace();
    }

That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.

Regex to remove html does not get rid of img tag

7 Answers7

OUTPUT