Is there a better way then using Lynx to convert HTML to Plaintext reliably in PHP

Question

I want to convert a HTML file with a table based layout to plaintext in order to send a multipart email via PHP.

I have tried a few different pre built classes / functions that I've found on SO, but none of them seem to produce decent results, which I believe is down to the table-based layout.

I don't want to roll my own class for stripping HTML and formatting the results as I am sure there are edge issues which I won't account for or be able to test until I come across them in production.

The best solution I've come up with so far is:

Create a temporary HTML file
Use something like shell_exec("/path/to/lynx -dump temporary.html"); to create a plaintext version of the email
Use some regex to get rid of any remaining unwanted tags

This works fine, but I'm a little worried that its not the optimal way of achieving a decent multipart email. Is anyone aware of a better way?

To clarify, I have already tried the following without success:

html2text class - http://www.chuggnutt.com/html2text.php
Markdownify - http://milianw.de/projects/markdownify/
html2text version 2 - http://www.howtocreate.co.uk/php/html2texthowto.html
http://journals.jevon.org/users/jevon-phd/entry/19818

Instead of creating the "temporary.txt" file, you can use the `-dump` parameter to return the text back to PHP. By using the `-stdin` switch (UNIX only), you can pass the HTML via STDIN into lynx as well. With `-verbose` you should be able to suppress the image tags. I found lynx always very good to create text-only representations of HTML sources. — hakre, Dec 27 '11 at 18:56
have you read the answers to same question in http://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail — macjohn, Dec 27 '11 at 19:02
@macjohn: Thanks for digging that up, interesting. But I think Riceo tried it already according to the question. But please leave the comment in so that both questions are linked. — hakre, Dec 27 '11 at 19:05
At the risk of asking the obvious, what about strip_tags(), htmlspecialchars() or htmlentities()? — GordonM, Dec 27 '11 at 19:45
@GordonM Thanks for the input, however when the above helpers / Lynx convert the HTML to plaintext they attempt to retain the original layout and anchor tags which is perfect for email clients that can't parse HTML, whereas strip_tags() etc will just leave the text behind. — Aaron, Dec 27 '11 at 19:54

score 1 · Answer 1 · answered Dec 28 '11 at 07:04

1

Lynx is not the best solution as I truly believe :) Also, I've used html2text myself and it works fine and is better than lynx.. anyway, if you prefer regexing it would rather be much more heavy than using the system shell (shell_exec, system, exec, popen), as you need to preg_replace all unnecessary tags, and in php regex is deadly slow. So I guess if it's on linux machine it's better to pass to html2text..

answered Dec 28 '11 at 07:04

Mr. BeatMasta

1,265
10
10

Thanks for the response. I would only use regex to clean a few tags that Lynx adds in, not the whole document. Stripping HTML via Regex would constitute "rolling my own" cleansing function, which could potentially leave me open to a lot of edge bugs. Also, html2text doesn't play nice with table based layouts. – Aaron Dec 28 '11 at 11:42
haven't tried html2text with table layout but I think there is hardly any soft that can deal with it normally :)) – Mr. BeatMasta Dec 30 '11 at 09:08

score 1 · Answer 2 · edited May 23 '17 at 12:03

1

PHP DomDocument should help you in this. You can traverse the DOM tree and strip out relevant content as you want.

http://php.net/manual/en/class.domdocument.php

Is there a better way then using Lynx to convert HTML to Plaintext reliably in PHP

2 Answers2