23

I am using Html.fromHtml(STRING).toString() to convert a string that may or may not have html and/or html entities in it, to a plain text string.

This is pretty slow, I think my last calculation was that it took about 22ms on avg. With a large batch of these it can add over a minute. So I am looking for a faster, performance built option.

Is there anyway to speed this up or are there other decoding options available?

Edit: Since there doesn't appear to be a built in method that is faster or built for performance specifically, I will reward the bounty to anyone that can point me in the direction of a library that:

  • Works well with Android
  • Licensed for free use
  • Faster than Html.fromHtml(String).toString();

As a note, I already tried Jsoup with this method: Jsoup.parse(String).text() and it was slower.

cottonBallPaws
  • 21,220
  • 37
  • 123
  • 171
  • Actually teh Html.fromHtml was very helpful for me to decode some "ISO-8859" thanks! – Nick Dec 19 '12 at 19:38

6 Answers6

35

What about org.apache.commons.lang.StringEscapeUtils's unescapeHtml(). The library is available on Apache site.

(EDIT: June 2019 - See the comments below for updates about the library)

karlcow
  • 6,977
  • 4
  • 38
  • 72
  • 1
    Works on Android: Check. Library is reliable, free and small: Check. Fast: Check. This is almost 22x the speed of Html.fromHtml(String).toString(); Thanks! – cottonBallPaws Feb 03 '11 at 06:37
  • 5
    There's a chance that `unescapeHtml4()` from Commons Lang 3.1 is significantly (200x) slower than `unescapeHtml()` from Commons Lang 2.6. I tested using DDMS and Traceview on Galaxy S3 with Android 4.1.1, and emulator with Android 2.2. I know that Dalvik Just-In-Time is disabled when tracing, so my findings may be inaccurate. But it was enough to scare me into using Commons Lang 2.6 instead of 3.1. – TalkLittle Jan 10 '13 at 22:55
  • 1
    I can confirm that 3.1 is slower than 2.6, at least it was when I tested it. – roim Jul 10 '13 at 07:57
  • Profiling on a Nexus 5 with Android Marshmallow, `Html.fromHtml()` is significantly faster than commons-lang3-3.1 `unescapeHtml3()` and `unescapeHtml4()`. There is no longer an `unescapeHtml()` method in 3.1 – cottonBallPaws Dec 10 '15 at 16:53
  • Ah it looks like if I upgrade to commons lang3 version 3.4, they are once again faster. Now 'unescapeHtml3' is almost twice as fast as 'Html.fromHtml' – cottonBallPaws Dec 10 '15 at 17:05
  • 1
    the API in above discussion is deprecated. if you wanna use latest version (3.9 for me), please use org.apache.commons.text.StringEscapeUtils.unescapeHtml4 or unescapeHtml3 to unescape it after import it inside android by adding => implementation 'org.apache.commons:commons-text:1.6' instead. – Jerry Chen May 23 '19 at 02:57
3

This is an incredibly fast and simple option: Unbescape

It greatly improved our parsing performance which requires every string to be run through a decoder.

Adam
  • 25,966
  • 23
  • 76
  • 87
  • 1
    Just tried a simple benchmark a while ago. `Html.fromHtml()` took 27.501 seconds to finish, versus 3.015 seconds for `HtmlEscape.unescapeHtml()` for the exact same test batch. It's a very significant improvement indeed. Thanks for the tip! – DPR Dec 28 '14 at 23:09
3

fromHtml() does not have a high-performance HTML parser, and I have no idea how quick the toString() implementation on SpannedString is. I doubt either were designed for your scenario.

Ideally, the strings are clean before they get to a low-power phone. Either clean them up in the build process (for resources/assets), or clean them up on a server (before you download them).

If, for whatever reason, you absolutely need to clean them up on the device, you can perhaps use the NDK to create a C/C++ library that does the cleaning for you faster.

CommonsWare
  • 986,068
  • 189
  • 2,389
  • 2,491
  • Unfortunately, I won't be able to get them cleaned up before they arrive on the device. I don't suppose you know of any libraries that are already available? – cottonBallPaws Jan 31 '11 at 03:00
0

Have you looked at Strip HTML from Text JavaScript

Community
  • 1
  • 1
FrinkTheBrave
  • 3,894
  • 10
  • 46
  • 55
0

With a large batch of these it can add over a minute

Any parsing will take some time. 22ms seems to me like fast. Anyway, can you do it in background? Can help you some kind of caching?

gorlok
  • 1,155
  • 6
  • 16
  • It is being done in the background, but it is something the user has to wait for. Yes, it is an ok speed, but when it has to process a large batch (a couple thousand) it can cause the user to be waiting for over a minute, which is just not great. – cottonBallPaws Feb 02 '11 at 05:07
0

Although I have not tried them yet, I found some possible solutions:

  1. HTML Java Parsers
  2. HTML Parsing
  3. More HTML Parsing

I hope it helps.

gnclmorais
  • 4,897
  • 5
  • 30
  • 41