88

I am trying to parse HTML in android from a webpage, and since the webpage it not well formed, I get SAXException.

Is there a way to parse HTML in Android?

Yi Jiang
  • 49,435
  • 16
  • 136
  • 136
Daniel Benedykt
  • 6,496
  • 12
  • 51
  • 73
  • I suspect the Rhino dependency will make htmlunit hell to compile on Android, but you could try... Also, some other non-strict HTML parser such as soup might work. – alex Feb 02 '10 at 22:04
  • I wonder if webkit can be used here. – mtmk Feb 02 '10 at 22:20

5 Answers5

76

I just encountered this problem. I tried a few things, but settled on using JSoup. The jar is about 132k, which is a bit big, but if you download the source and take out some of the methods you will not be using, then it is not as big.
=> Good thing about it is that it will handle badly formed HTML

Here's a good example from their site.

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

//http://jsoup.org/cookbook/input/load-document-from-url
//Document doc = Jsoup.connect("http://example.com/").get();

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}
Gabe
  • 84,912
  • 12
  • 139
  • 238
ibaralf
  • 12,218
  • 5
  • 47
  • 69
57

Have you tried using Html.fromHtml(source)?

I think that class is pretty liberal with respect to source quality (it uses TagSoup internally, which was designed with real-life, bad HTML in mind). It doesn't support all HTML tags though, but it does come with a handler you can implement to react on tags it doesn't understand.

mxk
  • 43,056
  • 28
  • 105
  • 132
  • 1
    This is very simple, I cannot search for exact things (like XPATH) –  Nov 09 '15 at 12:08
  • attention please. this will "Suspending all threads". I face with than when get a json with html format text with in it. there was no problem with showing html text rightly but after use html.fromhtml() I face with this. – David Feb 23 '16 at 13:37
25
String tmpHtml = "<html>a whole bunch of html stuff</html>";
String htmlTextStr = Html.fromHtml(tmpHtml).toString();
EddieB
  • 4,991
  • 3
  • 23
  • 18
  • nice and simple, no plugins, love it! tnxs – RonEskinder Dec 16 '15 at 18:28
  • 2
    As a note: calling `toString()` on the `Spanned` object returned from `Html.fromHtml(str)` will make many of the `HTML` tags not work (including `` `` ``). So if you're setting a textview just do: `myTextView.setText(Html.fromHtml(str))` – Sakiboy May 11 '16 at 19:34
  • @Sakiboy You are right. In addition to this there are many other tags that does not work with `Html.fromHtml()`. Check this out http://stackoverflow.com/a/3150456/1987045 – rahulrvp Sep 28 '16 at 13:13
  • awesome , exactly what i wanted , my server side dev was sending me html , now i can easily convert it to raw string thanks – Zulqurnain Jutt Apr 06 '18 at 08:05
3

We all know that programming have endless possibilities.There are numbers of solutions available for a single problem so i think all of the above solutions are perfect and may be helpful for someone but for me this one save my day..

So Code goes like this

  private void getWebsite() {
    new Thread(new Runnable() {
      @Override
      public void run() {
        final StringBuilder builder = new StringBuilder();

        try {
          Document doc = Jsoup.connect("http://www.ssaurel.com/blog").get();
          String title = doc.title();
          Elements links = doc.select("a[href]");

          builder.append(title).append("\n");

          for (Element link : links) {
            builder.append("\n").append("Link : ").append(link.attr("href"))
            .append("\n").append("Text : ").append(link.text());
          }
        } catch (IOException e) {
          builder.append("Error : ").append(e.getMessage()).append("\n");
        }

        runOnUiThread(new Runnable() {
          @Override
          public void run() {
            result.setText(builder.toString());
          }
        });
      }
    }).start();
  }

You just have to call the above function in onCreate Method of your MainActivity

I hope this one is also helpful for you guys.

Also read the original blog at Medium

Nitin
  • 1,280
  • 1
  • 13
  • 17
1

Maybe you can use WebView, but as you can see in the doc WebView doesn't support javascript and other stuff like widgets by default.

http://developer.android.com/reference/android/webkit/WebView.html

I think that you can enable javascript if you need it.

oropher
  • 29
  • 3