2

I want to convert a HTML file with a table based layout to plaintext in order to send a multipart email via PHP.

I have tried a few different pre built classes / functions that I've found on SO, but none of them seem to produce decent results, which I believe is down to the table-based layout.

I don't want to roll my own class for stripping HTML and formatting the results as I am sure there are edge issues which I won't account for or be able to test until I come across them in production.

The best solution I've come up with so far is:

  1. Create a temporary HTML file
  2. Use something like shell_exec("/path/to/lynx -dump temporary.html"); to create a plaintext version of the email
  3. Use some regex to get rid of any remaining unwanted tags

This works fine, but I'm a little worried that its not the optimal way of achieving a decent multipart email. Is anyone aware of a better way?

To clarify, I have already tried the following without success:

Aaron
  • 458
  • 4
  • 16
  • 2
    Instead of creating the "temporary.txt" file, you can use the `-dump` parameter to return the text back to PHP. By using the `-stdin` switch (UNIX only), you can pass the HTML via STDIN into lynx as well. With `-verbose` you should be able to suppress the image tags. I found lynx always very good to create text-only representations of HTML sources. – hakre Dec 27 '11 at 18:56
  • Ah thanks, '-verbose' saves me some time! – Aaron Dec 27 '11 at 19:00
  • 1
    have you read the answers to same question in http://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail – macjohn Dec 27 '11 at 19:02
  • @macjohn: Thanks for digging that up, interesting. But I think Riceo tried it already according to the question. But please leave the comment in so that both questions are linked. – hakre Dec 27 '11 at 19:05
  • @macjohn Yep I've tried the proposed solutions there. – Aaron Dec 27 '11 at 19:07
  • At the risk of asking the obvious, what about strip_tags(), htmlspecialchars() or htmlentities()? – GordonM Dec 27 '11 at 19:45
  • @GordonM Thanks for the input, however when the above helpers / Lynx convert the HTML to plaintext they attempt to retain the original layout and anchor tags which is perfect for email clients that can't parse HTML, whereas strip_tags() etc will just leave the text behind. – Aaron Dec 27 '11 at 19:54
  • What problems did you have with e.g. html2text? – Alan H. Apr 10 '12 at 02:46

2 Answers2

1

Lynx is not the best solution as I truly believe :) Also, I've used html2text myself and it works fine and is better than lynx.. anyway, if you prefer regexing it would rather be much more heavy than using the system shell (shell_exec, system, exec, popen), as you need to preg_replace all unnecessary tags, and in php regex is deadly slow. So I guess if it's on linux machine it's better to pass to html2text..

Mr. BeatMasta
  • 1,265
  • 10
  • 10
  • Thanks for the response. I would only use regex to clean a few tags that Lynx adds in, not the whole document. Stripping HTML via Regex would constitute "rolling my own" cleansing function, which could potentially leave me open to a lot of edge bugs. Also, html2text doesn't play nice with table based layouts. – Aaron Dec 28 '11 at 11:42
  • haven't tried html2text with table layout but I think there is hardly any soft that can deal with it normally :)) – Mr. BeatMasta Dec 30 '11 at 09:08
1

PHP DomDocument should help you in this. You can traverse the DOM tree and strip out relevant content as you want.

http://php.net/manual/en/class.domdocument.php

Related question on SO :

Parse HTML with PHP's HTML DOMDocument

Community
  • 1
  • 1
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175