Java remove HTML from String without regular expressions

Question

I am trying to remove all HTML elements from a String. Unfortunately, I cannot use regular expressions because I am developing on the Blackberry platform and regular expressions are not yet supported.

Is there any other way that I can remove HTML from a string? I read somewhere that you can use a DOM Parser, but I couldn't find much on it.

Text with HTML:

<![CDATA[As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (<a href="http://www.netflix.com/RoleDisplay/Billy_Bob_Thornton/20000303">Billy Bob Thornton</a>) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (<a href="http://www.netflix.com/RoleDisplay/Bruce_Willis/99786">Bruce Willis</a>) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task. <a href="http://www.netflix.com/RoleDisplay/Ben_Affleck/20000016">Ben Affleck</a> and <a href="http://www.netflix.com/RoleDisplay/Liv_Tyler/162745">Liv Tyler</a> co-star.]]>

Text without HTML:

As a massive asteroid hurtles toward Earth, NASA head honcho Dan Truman (Billy Bob Thornton) hatches a plan to split the deadly rock in two before it annihilates the entire planet, calling on Harry Stamper (Bruce Willis) -- the world's finest oil driller -- to head up the mission. With time rapidly running out, Stamper assembles a crack team and blasts off into space to attempt the treacherous task.Ben Affleck and Liv Tyler co-star.

Thanks!

Is `Swing` available in the Blackberry API? For more hints check the accepted answer of this question: http://stackoverflow.com/questions/240546/removing-html-from-a-java-string — BalusC, Mar 21 '10 at 22:32
Unfortunately, Swing is not available in the BlackBerry API... — littleK, Mar 21 '10 at 22:42

score 4 · Accepted Answer · answered Mar 21 '10 at 23:24

There are a lot of nuances to parsing HTML in the wild, one of the funnier ones being that many pages out there do not follow any standard. This said, if all your HTML is going to be as simple as your example, something like this is more than enough:

    char[] cs = s.toCharArray();
    StringBuilder sb = new StringBuilder();
    boolean tag = false;
    for (int i=0; i<cs.length; i++) {
        switch(cs[i]) {
            case '<': if ( ! tag) { tag = true; break; }
            case '>': if (tag) { tag = false; break; }
            case '&': i += interpretEscape(cs, i, sb); break;
            default: if ( ! tag) sb.append(cs[i]);
        }
    }
    System.err.println(sb);

Where interpretEscape() is supposed to know how to convert HTML escapes such as > to their character counterparts, and skip all characters up to the ending ;.

The HTML should always be pretty simple, as shown in my example. This works for me. Thanks very much! — littleK, Mar 21 '10 at 23:40
Looks good. You'll probably need to alter it slightly for the <![CDATA[]> though: the current one will skip the whole content. — Daniel, Mar 21 '10 at 23:45

score 4 · Answer 2 · answered Mar 22 '10 at 09:25

4

I cannot use regular expressions because I am developing on the Blackberry platform

You cannot use regular expressions because HTML is a recursive language and regular expressions can't handle those.

You need a parser.

answered Mar 22 '10 at 09:25

user207421

305,947
44
307
483

score 1 · Answer 3 · answered Mar 21 '10 at 23:10

If you can add external jars you can try with those two small libs:

tagsoup, it's a sax parser
jericho html, another small html parser

they both allow you to strip everything.

I used jericho many times, to strip you define an extractor as you like it:

class HTMLStripExtractor extends TextExtractor
{
    public HTMLStripExtractor(Source src)
    {       
        super(src)  
        src.setLogger(null)
    }

    public boolean excludeElement(StartTag startTag)
    {
        return startTag.getName() != HTMLElementName.A
    }
}

score 1 · Answer 4 · answered Mar 21 '10 at 23:14

I'd try to tackle this the other way around, create a DOM tree from the HTML and then extract the string from the tree:

Use a library like TagSoup to parse in the HTML while cleaning it up to be close to XHTML.
As you're streaming the cleaned up XHTML, extract the text you want.

Java remove HTML from String without regular expressions

4 Answers4

Linked