By chance I am working on a similar subject. I don't tell you any authoritative statement as it is too early. To start with I took 3 engines:
Of course, there are a lot more options (even I wrote once a simple html viewer for Palm OS), but this seemed to be a good start.
Majestic did not offer Html->text conversion, just a sample code how to walk over the html string. To start with I implemented trivial conversion:
- Write all text nodes
- Convert <p> to "\n\n" and <br> to "/n"
- Ignoring everything else
Then I collected a sample of 50+ html files and converted them using all 3 methods. I have to say that I wasn't happy with either method. Two general observations:
- Results from Majestic and Agility were remarkably similar
- Regex method was an order of magnitude slower.
So I looked into the Regex code and found a nonsense loop at the bottom. After an easy optimization Regex method was only ~25% slower. Given that it makes more than 30 complex Regex replacements, I considered this a good result.
Then I wrote a test html file containing all common html tags and a bit more. As before, Majestic and Agility performed similarly.
- All engines ok: h1, p, tags written as text
- All engines failed: h2+, hr, b
- br: Regex failed, Majestic ok
- Lists: Regex ok, Majestic failed
- Simple 2x2 table: Regex ok, Majestic failed
There's a lot more to test. For example encoding.
At this moment I would only say that Regex seems to be a better alternative. However, none of the mentioned engines performs satisfactorily. On the positive note, tweaking these engines (particularly Majestic and Regex) is easy. Maybe the same holds true for Agility as well, however, I did not look into the package deep enough to say that.