Converting HTML to plain text in PHP for e-mail

Question

I use TinyMCE to allow minimal formatting of text within my site. From the HTML that's produced, I'd like to convert it to plain text for e-mail. I've been using a class called html2text, but it's really lacking in UTF-8 support, among other things. I do, however, like that it maps certain HTML tags to plain text formatting — like putting underscores around text that previously had tags in the HTML.

Does anyone use a similar approach to converting HTML to plain text in PHP? And if so: Do you recommend any third-party classes that I can use? Or how do you best tackle this issue?

See also ["HTML to plain text (for email)"](http://stackoverflow.com/questions/1930297/html-to-plain-text-for-email) — outis, Apr 26 '11 at 23:01
html2text has [scary code execution vulnerabilities](http://www.madirish.net/node/225). — Tgr, Nov 28 '11 at 11:57
For reference, wikipedia [links to a survey](http://en.wikipedia.org/wiki/HTML_email#cite_note-clickz_data-5) that said only about 3% of people use text-only email. — Redzarf, Aug 13 '13 at 18:33
@Redzarf it's not about these 3%. Adding a plain text part is a really good idea if you don't want your email to go directly to the spam folder. Plus, these 3% are probably not taking into account light mobile clients. Last but not least: 3% is greater that 0%, which should make you consider it seriously. — Ninj, Oct 02 '13 at 09:53
@Ninj I just checked and the survey was from 2002, so things will have changed since then (though I still think 3% is probably about right.) Good point about the spam issue - for anyone reading this later who is concerned about spam, I found that this tool was excellent: http://www.port25.com/support/authentication-center/email-verification/ — Redzarf, Oct 02 '13 at 13:37
Its also handy for converting HTML emails to plain text for other contexts (like storing message in db or printing out as clean text, etc) so just because I don't read my email as plain text doesn't mean I might not need a plain text copy for other uses — Anthony, Mar 25 '15 at 09:59
adding a text part in addition to html also gives you another point with SpamAssassin: https://wiki.apache.org/spamassassin/Rules/MIME_HTML_ONLY — Wes, Oct 12 '17 at 13:48
here is a simple solution htmlspecialchars(trim(strip_tags($htmlString))); $htmlString will be replaced by your html text — Abhijeet kumar sharma, Aug 22 '18 at 11:51

score 113 · Accepted Answer · edited Jun 12 '18 at 00:23

113

Use html2text (example HTML to text), licensed under the Eclipse Public License. It uses PHP's DOM methods to load from HTML, and then iterates over the resulting DOM to extract plain text. Usage:

// when installed using the Composer package
$text = Html2Text\Html2Text::convert($html);

// usage when installed using html2text.php
require('html2text.php');
$text = convert_html_to_text($html);

Although incomplete, it is open source and contributions are welcome.

Issues with other conversion scripts:

Since html2text (GPL) is not EPL-compatible.
lkessler's link (attribution) is incompatible with most open source licenses.

edited Jun 12 '18 at 00:23

Abhi Beckert

32,787
12
83
110

answered Apr 02 '10 at 00:32

jevon

3,197
3
32
40

2

The first script above is released under the GPL, which is *not* a "non-commercial" license. Depending on context it may be undesirable, but it is not "non-commercial". The second link also allows commercial use - just with attribution. That not "non-commercial" either. – Oliver Moran May 19 '13 at 20:48
1

@OliverMoran You're right, I've edited the answer to more accurately reflect their license limitations. – jevon May 20 '13 at 21:57
Thank you @jevon, i included your work in my project and it works great! Unfortunately, it didn't help to solve my Outlook problem (http://stackoverflow.com/questions/19135443/why-wont-outlook-use-the-text-plain-part) but i get clean result that way. – Ninj Oct 02 '13 at 11:57
Link broken. Down-voting. – Sibidharan Oct 14 '16 at 08:19
please clarify, but who will detect if someone is using or not under GLP or whatever? – Miguel Mar 30 '17 at 14:49
This has some issues in PHP 7 – Brian Leishman Apr 11 '17 at 15:07
I have not seen a `convert_html_to_text()` function, although I was able to make the Html2Text (very first link) work without much of a problem. – Alexis Wilke Jul 30 '17 at 07:19
To remove duplicate line breaks: `preg_replace('/\n{2,}/', "\n", Html2Text::convert($html, ['ignore_errors' => true]))` – Maxim Mandrik May 29 '22 at 18:42
That class is not really ready yet. It ignores visibility and display attributes, so you will see hidden stuff that can break the entire output. It's formating of tables is not column based, so a table of 4 columns will be broken into vertical blocks. – John Feb 19 '23 at 15:41

T.Todua · Answer 2 · 2020-12-25T09:33:21.860

39

here is another solution:

$cleaner_input = strip_tags($text);

For other variations of sanitization functions, see:

https://github.com/ttodua/useful-php-scripts/blob/master/filter-php-variable-sanitize.php

edited Dec 25 '20 at 09:33

answered Jun 25 '13 at 16:58

T.Todua

53,146
19
236
237

25

Better version `$ClearText = preg_replace( "/\n\s+/", "\n", rtrim(html_entity_decode(strip_tags($HTMLText))) );` – mAsT3RpEE Jan 27 '14 at 14:11
2

this is so simple and no need another library. also working very well.......... :) – mili Nov 27 '18 at 00:30
To remove duplicate line breaks: `preg_replace('/\n{2,}/', "\n", strip_tags($htmlText))` – Maxim Mandrik May 29 '22 at 18:41
This also returns the javascript codes. – Crouching Kitten Apr 08 '23 at 23:10

pestilence669 · Answer 3 · 2009-12-11T01:46:23.590

16

There's the trusty strip_tags function. It's not pretty though. It'll only sanitize. You could combine it with a string replace to get your fancy underscores.


<?php
// to strip all tags and wrap italics with underscore
strip_tags(str_replace(array("<i>", "</i>"), array("_", "_"), $text));

// to preserve anchors...
str_replace("|a", "<a", strip_tags(str_replace("<a", "|a", $text)));

?>

edited Dec 11 '09 at 01:46

answered Dec 10 '09 at 23:07

pestilence669

5,698
1
23
35

Don't forget that strip tags also removes anchors! – Alix Axel Dec 10 '09 at 23:58

score 13 · Answer 4 · edited Feb 22 '14 at 22:53

Converting from HTML to text using a DOMDocument is a viable solution. Consider HTML2Text, which requires PHP5:

Regarding UTF-8, the write-up on the "howto" page states:

PHP's own support for unicode is quite poor, and it does not always handle utf-8 correctly. Although the html2text script uses unicode-safe methods (without needing the mbstring module), it cannot always cope with PHP's own handling of encodings. PHP does not really understand unicode or encodings like utf-8, and uses the base encoding of the system, which tends to be one of the ISO-8859 family. As a result, what may look to you like a valid character in your text editor, in either utf-8 or single-byte, may well be misinterpreted by PHP. So even though you think you are feeding a valid character into html2text, you may well not be.

The author provides several approaches to solving this and states that version 2 of HTML2Text (using DOMDocument) has UTF-8 support.

Note the restrictions for commercial use.

Markdownify is no longer maintained; the online demo throws many warnings and doesn't work. The new version of html2text does work for my email. A late +1 to lkessler. — malcanso, Sep 23 '13 at 22:38

score 9 · Answer 5 · answered Mar 08 '12 at 02:32

You can use lynx with -stdin and -dump options to achieve that:

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}

score 8 · Answer 6 · answered Dec 13 '13 at 03:40

You can test this function

function html2text($Document) {
    $Rules = array ('@<script[^>]*?>.*?</script>@si',
                    '@<[\/\!]*?[^<>]*?>@si',
                    '@([\r\n])[\s]+@',
                    '@&(quot|#34);@i',
                    '@&(amp|#38);@i',
                    '@&(lt|#60);@i',
                    '@&(gt|#62);@i',
                    '@&(nbsp|#160);@i',
                    '@&(iexcl|#161);@i',
                    '@&(cent|#162);@i',
                    '@&(pound|#163);@i',
                    '@&(copy|#169);@i',
                    '@&(reg|#174);@i',
                    '@&#(d+);@e'
             );
    $Replace = array ('',
                      '',
                      '',
                      '',
                      '&',
                      '<',
                      '>',
                      ' ',
                      chr(161),
                      chr(162),
                      chr(163),
                      chr(169),
                      chr(174),
                      'chr()'
                );
  return preg_replace($Rules, $Replace, $Document);
}

Thanks for this. Worked great for my use (converting HTML for an RSS feed), and provided a simple template for adding two additional cases (’ and —). — Alan M., Jan 08 '14 at 22:35
On local working but got error online "preg_replace(): The /e modifier is no longer supported, use preg_replace_callback" — Sandeep Sherpur, Feb 23 '23 at 11:58

Rob · Answer 7 · 2016-11-21T20:19:43.593

6

I didn't find any of the existing solutions fitting - simple HTML emails to simple plain text files.

I've opened up this repository, hope it helps someone. MIT license, by the way :)

https://github.com/RobQuistNL/SimpleHtmlToText

Example:

$myHtml = '<b>This is HTML</b><h1>Header</h1><br/><br/>Newlines';
echo (new Parser())->parseString($myHtml);

returns:

**This is HTML**
### Header ###


Newlines

edited Nov 21 '16 at 20:19

answered Nov 21 '16 at 15:34

Rob

4,927
4
26
41

Flagged as low-quality for length and content. I dunno. Maybe the post should say something about how your code can be used to answer the problem, or maybe it should be a comment. The most popular answers seem to show how solutions can be invoked from within PHP code. – Bill Bell Nov 21 '16 at 16:54
I'm sorry for writing that library. I've added a little example for you if you don't want to click the link and look at the example.. – Rob Nov 21 '16 at 20:20
2

Don't be sorry! :-) I was writing as an SO reviewer. It isn't that I didn't want to click the link. It's that SO answers that require that one do that are considered substandard. I dunno why anyone would down-vote your answer incidentally. – Bill Bell Nov 21 '16 at 23:55

Aommy Indy · Answer 8 · 2018-05-28T10:14:39.880

6

public function plainText($text)
{
    $text = strip_tags($text, '<br><p><li>');
    $text = preg_replace ('/<[^>]*>/', PHP_EOL, $text);

    return $text;
}

$text = "string 1 string 2 <ul><li>string 3</li><li>string 4</li></ul>string 5";

echo planText($text);

output
string 1
string 2
string 3
string 4
string 5

edited May 28 '18 at 10:14

answered Aug 11 '17 at 08:11

Aommy Indy

129
1
5

1

dont add just answer. Please add text why this is answer – Himanth Aug 11 '17 at 08:35

score 4 · Answer 9 · answered May 15 '18 at 14:36

If you want to convert the HTML special characters and not just remove them as well as strip things down and prepare for plain text this was the solution that worked for me...

function htmlToPlainText($str){
    $str = str_replace('&nbsp;', ' ', $str);
    $str = html_entity_decode($str, ENT_QUOTES | ENT_COMPAT , 'UTF-8');
    $str = html_entity_decode($str, ENT_HTML5, 'UTF-8');
    $str = html_entity_decode($str);
    $str = htmlspecialchars_decode($str);
    $str = strip_tags($str);

    return $str;
}

$string = '<p>this is (&nbsp;) a test</p>
<div>Yes this is! &amp; does it get "processed"? </div>'

htmlToPlainText($string);
// "this is ( ) a test. Yes this is! & does it get processed?"`

html_entity_decode w/ ENT_QUOTES | ENT_XML1 converts things like ' htmlspecialchars_decode converts things like & html_entity_decode converts things like '< and strip_tags removes any HTML tags left over.

score 3 · Answer 10 · answered Dec 28 '11 at 10:14

3

Markdownify converts HTML to Markdown, a plain-text formatting system used on this very site.

answered Dec 28 '11 at 10:14

outis

75,655
22
151
221

A good choice, except for how it handles links. But try the online demo if you're considering it. – Redzarf Aug 13 '13 at 18:31

score 2 · Answer 11 · answered Nov 24 '16 at 16:10

I came around the same problem as the OP, and trying some solutions from the top answers above didn't prove to work for my scenarios. See why at the end.

Instead, I found this helpful script, to avoid confusion let's call it html2text_roundcube, available under GPL:

https://github.com/mtibben/html2text

It's actually an updated version of an already mentioned script - http://www.chuggnutt.com/html2text.php - updated by RoundCube mail.

Usage:

$h2t = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');
echo $h2t->getText(); // prints Hello, "WORLD"

Why html2text_roundcube proved better than the others:

Script http://www.chuggnutt.com/html2text.php didn't work out of the box for cases with special HTML codes/names (eg ä), or unpaired quotes (eg 25" Monitor).
Script https://github.com/soundasleep/html2text had no option to hide or group the links at the end of the text, making a usual HTML page look bloated with links when in text-plain format; customizing the code for special treatment of how the transformation is done is not as straight forward as simply editing an array in html2text_roundcube.

score 2 · Answer 12 · answered Sep 03 '19 at 18:39

For texts in utf-8, it worked for me mb_convert_encoding. To process everything regardless of errors, make sure you use the "@".

The basic code I use is:

$dom = new DOMDocument();
@$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));

$body = $dom->getElementsByTagName('body')->item(0);
echo $body->textContent;

If you want something more advanced, you can iteratively analyze the nodes, but you will encounter many problems with whitespaces.

I have implemented a converter based on what I say here. If you are interested, you can download it from git https://github.com/kranemora/html2text

It may serve as a reference to make yours

You can use it like this:

$html = <<<EOF
<p>Welcome to <strong>html2text<strong></p>
<p>It's <em>works</em> for you?</p>
EOF;

$html2Text = new \kranemora\Html2Text\Html2Text;
$text = $html2Text->convert($html);

score 1 · Answer 13 · answered May 16 '12 at 21:17

I have just found a PHP function "strip_tags()" and its working in my case.

I tried to convert the following HTML :

<p><span style="font-family: 'Verdana','sans-serif'; color: black; font-size: 7.5pt;">&nbsp;</span>Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry's lackluster performance during this time,  revenue has grown at an average annual rate&nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&nbsp; So despite the downturn, how were we  able to manage growth as an industry?</p>

After applying strip_tags() function, I have got the following output :

&amp;nbsp;Many  practitioners are optimistic that the eyeglass and contact lens  industry will recover from the recent economic storm. Did your practice  feel its affects?&amp;nbsp; Statistics show revenue notably declined in 2008 and  2009. But interestingly enough, those that monitor these trends state  that despite the industry&#039;s lackluster performance during this time,  revenue has grown at an average annual rate&amp;nbsp;of 2.2% over the last five  years, to $9.0 billion in 2010.&amp;nbsp; So despite the downturn, how were we  able to manage growth as an industry?

strip_tags() won't handle a case where you have multiple elements on several lines which are considered by html as 'inline' and will display them on multiple lines. Also, the reverse case - if you have multiple div elements on one line, it will strip the tags and concatenate the content. I've shared my experience here: http://stackoverflow.com/questions/1930297/html-to-plain-text-for-email/12563906#12563906 — Nikola Petkanski, Sep 24 '12 at 12:55

score 1 · Answer 14 · answered Apr 02 '18 at 17:02

If you don't want to strip the tags completely and keep the content inside the tags, you can use the DOMDocument and extract the textContent of the root node like this:

function html2text($html) {
    $dom = new DOMDocument();
    $dom->loadHTML("<body>" . strip_tags($html, '<b><a><i><div><span><p>') . "</body>");
    $xpath = new DOMXPath($dom);
    $node = $xpath->query('body')->item(0);
    return $node->textContent; // text
}

$p = 'this is <b>test</b>. <p>how are <i>you?</i>. <a href="#">I\'m fine!</a></p>';
print html2text($p);
// this is test. how are you?. I'm fine!

One advantage of this approach is that it does not require any external packages.

score 0 · Answer 15 · answered Mar 07 '23 at 10:30

You can try this, the whole script and demo in one file

$html ="<h1>Hi Sandeep!</h1>
<p>This is some e-mail content in html.
Even though it has whitespace and newlines, the e-mail converter
will handle it correctly.

<p>Even mismatched tags.</p>

<div>A div</div>
<div>Another div</div>
<div>A div<div>within a div</div></div>";

$Html2Text = new Html2Text();
$text = $Html2Text->convert($html);

echo '<pre>'; print_r($text); die();

class Html2Text {

/** @return array<string, bool | string> */
public static function defaultOptions(): array {
    return [
        'ignore_errors' => false,
        'drop_links'    => false,
        'char_set'      => 'auto'
    ];
}

/**
 * Tries to convert the given HTML into a plain text format - best suited for
 * e-mail display, etc.
 *
 * <p>In particular, it tries to maintain the following features:
 * <ul>
 *   <li>Links are maintained, with the 'href' copied over
 *   <li>Information in the &lt;head&gt; is lost
 * </ul>
 *
 * @param string $html the input HTML
 * @param boolean|array<string, bool | string> $options if boolean, Ignore xml parsing errors, else ['ignore_errors' => false, 'drop_links' => false, 'char_set' => 'auto']
 * @return string the HTML converted, as best as possible, to text
 * @throws Html2TextException if the HTML could not be loaded as a {@link \DOMDocument}
 */
public static function convert(string $html, $options = []): string {

    if ($options === false || $options === true) {
        // Using old style (< 1.0) of passing in options
        $options = ['ignore_errors' => $options];
    }

    $options = array_merge(static::defaultOptions(), $options);

    // check all options are valid
    foreach ($options as $key => $value) {
        if (!in_array($key, array_keys(static::defaultOptions()))) {
            throw new \InvalidArgumentException("Unknown html2text option '$key'. Valid options are " . implode(',', static::defaultOptions()));
        }
    }

    $is_office_document = self::isOfficeDocument($html);

    if ($is_office_document) {
        // remove office namespace
        $html = str_replace(["<o:p>", "</o:p>"], "", $html);
    }

    $html = self::fixNewlines($html);

    // use mb_convert_encoding for legacy versions of php
    if (PHP_MAJOR_VERSION * 10 + PHP_MINOR_VERSION < 81 && mb_detect_encoding($html, "UTF-8", true)) {
        $html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
    }

    $doc = self::getDocument($html, $options);

    $output = self::iterateOverNode($doc, null, false, $is_office_document, $options);

    // process output for whitespace/newlines
    $output = self::processWhitespaceNewlines($output);

    return $output;
}

/**
 * Unify newlines; in particular, \r\n becomes \n, and
 * then \r becomes \n. This means that all newlines (Unix, Windows, Mac)
 * all become \ns.
 *
 * @param string $text text with any number of \r, \r\n and \n combinations
 * @return string the fixed text
 */
public static function fixNewlines(string $text): string {
    // replace \r\n to \n
    $text = str_replace("\r\n", "\n", $text);
    // remove \rs
    $text = str_replace("\r", "\n", $text);

    return $text;
}

/** @return array<string> */
public static function nbspCodes(): array {
    return [
        "\xc2\xa0",
        "\u00a0",
    ];
}

/** @return array<string> */
public static function zwnjCodes(): array {
    return [
        "\xe2\x80\x8c",
        "\u200c",
    ];
}

/**
 * Remove leading or trailing spaces and excess empty lines from provided multiline text
 *
 * @param string $text multiline text any number of leading or trailing spaces or excess lines
 * @return string the fixed text
 */
public static function processWhitespaceNewlines(string $text): string {

    // remove excess spaces around tabs
    $text = preg_replace("/ *\t */im", "\t", $text);

    // remove leading whitespace
    $text = ltrim($text);

    // remove leading spaces on each line
    $text = preg_replace("/\n[ \t]*/im", "\n", $text);

    // convert non-breaking spaces to regular spaces to prevent output issues,
    // do it here so they do NOT get removed with other leading spaces, as they
    // are sometimes used for indentation
    $text = self::renderText($text);

    // remove trailing whitespace
    $text = rtrim($text);

    // remove trailing spaces on each line
    $text = preg_replace("/[ \t]*\n/im", "\n", $text);

    // unarmor pre blocks
    $text = self::fixNewLines($text);

    // remove unnecessary empty lines
    $text = preg_replace("/\n\n\n*/im", "\n\n", $text);

    return $text;
}

/**
 * Can we guess that this HTML is generated by Microsoft Office?
 */
public static function isOfficeDocument(string $html): bool {
    return strpos($html, "urn:schemas-microsoft-com:office") !== false;
}

public static function isWhitespace(string $text): bool {
    return strlen(trim(self::renderText($text), "\n\r\t ")) === 0;
}

/**
 * Parse HTML into a DOMDocument
 *
 * @param string $html the input HTML
 * @param array<string, bool | string> $options
 * @return \DOMDocument the parsed document tree
 */
private static function getDocument(string $html, array $options): \DOMDocument {

    $doc = new \DOMDocument();

    $html = trim($html);

    if (!$html) {
        // DOMDocument doesn't support empty value and throws an error
        // Return empty document instead
        return $doc;
    }

    if ($html[0] !== '<') {
        // If HTML does not begin with a tag, we put a body tag around it.
        // If we do not do this, PHP will insert a paragraph tag around
        // the first block of text for some reason which can mess up
        // the newlines. See pre.html test for an example.
        $html = '<body>' . $html . '</body>';
    }

    $header = '';
    // use char sets for modern versions of php
    if (PHP_MAJOR_VERSION * 10 + PHP_MINOR_VERSION >= 81) {
        // use specified char_set, or auto detect if not set
        $char_set = ! empty($options['char_set']) ? $options['char_set'] : 'auto';
        if ('auto' === $char_set) {
            $char_set = mb_detect_encoding($html);
        } else if (strpos($char_set, ',')) {
            mb_detect_order($char_set);
            $char_set = mb_detect_encoding($html);
        }
        // turn off error detection for Windows-1252 legacy html
        if (strpos($char_set, '1252')) {
            $options['ignore_errors'] = true;
        }
        $header = '<?xml version="1.0" encoding="' . $char_set . '">';
    }

    if (! empty($options['ignore_errors'])) {
        $doc->strictErrorChecking = false;
        $doc->recover = true;
        $doc->xmlStandalone = true;
        $old_internal_errors = libxml_use_internal_errors(true);
        $load_result = $doc->loadHTML($header . $html, LIBXML_NOWARNING | LIBXML_NOERROR | LIBXML_NONET | LIBXML_PARSEHUGE);
        libxml_use_internal_errors($old_internal_errors);
    }
    else {
        $load_result = $doc->loadHTML($header . $html);
    }

    if (!$load_result) {
        throw new Html2TextException("Could not load HTML - badly formed?", $html);
    }

    return $doc;
}

/**
 * Replace any special characters with simple text versions, to prevent output issues:
 * - Convert non-breaking spaces to regular spaces; and
 * - Convert zero-width non-joiners to '' (nothing).
 *
 * This is to match our goal of rendering documents as they would be rendered
 * by a browser.
 */
private static function renderText(string $text): string {
    $text = str_replace(self::nbspCodes(), " ", $text);
    $text = str_replace(self::zwnjCodes(), "", $text);
    return $text;
}

private static function nextChildName(?\DOMNode $node): ?string {
    // get the next child
    $nextNode = $node->nextSibling;
    while ($nextNode != null) {
        if ($nextNode instanceof \DOMText) {
            if (!self::isWhitespace($nextNode->wholeText)) {
                break;
            }
        }

        if ($nextNode instanceof \DOMElement) {
            break;
        }

        $nextNode = $nextNode->nextSibling;
    }

    $nextName = null;
    if (($nextNode instanceof \DOMElement || $nextNode instanceof \DOMText) && $nextNode != null) {
        $nextName = strtolower($nextNode->nodeName);
    }

    return $nextName;
}

/** @param array<string, bool | string> $options */
private static function iterateOverNode(\DOMNode $node, ?string $prevName, bool $in_pre, bool $is_office_document, array $options): string {
    if ($node instanceof \DOMText) {
      // Replace whitespace characters with a space (equivilant to \s)
        if ($in_pre) {
            $text = "\n" . trim(self::renderText($node->wholeText), "\n\r\t ") . "\n";

            // Remove trailing whitespace only
            $text = preg_replace("/[ \t]*\n/im", "\n", $text);

            // armor newlines with \r.
            return str_replace("\n", "\r", $text);

        }
        $text = self::renderText($node->wholeText);
        $text = preg_replace("/[\\t\\n\\f\\r ]+/im", " ", $text);

        if (!self::isWhitespace($text) && ($prevName == 'p' || $prevName == 'div')) {
            return "\n" . $text;
        }
        return $text;
    }

    if ($node instanceof \DOMDocumentType || $node instanceof \DOMProcessingInstruction) {
        // ignore
        return "";
    }

    $name = strtolower($node->nodeName);
    $nextName = self::nextChildName($node);

    // start whitespace
    switch ($name) {
        case "hr":
            $prefix = '';
            if ($prevName != null) {
                $prefix = "\n";
            }
            return $prefix . "---------------------------------------------------------------\n";

        case "style":
        case "head":
        case "title":
        case "meta":
        case "script":
            // ignore these tags
            return "";

        case "h1":
        case "h2":
        case "h3":
        case "h4":
        case "h5":
        case "h6":
        case "ol":
        case "ul":
        case "pre":
            // add two newlines
            $output = "\n\n";
            break;

        case "td":
        case "th":
            // add tab char to separate table fields
           $output = "\t";
           break;

        case "p":
            // Microsoft exchange emails often include HTML which, when passed through
            // html2text, results in lots of double line returns everywhere.
            //
            // To fix this, for any p element with a className of `MsoNormal` (the standard
            // classname in any Microsoft export or outlook for a paragraph that behaves
            // like a line return) we skip the first line returns and set the name to br.
            // @phpstan-ignore-next-line
            if ($is_office_document && $node->getAttribute('class') == 'MsoNormal') {
                $output = "";
                $name = 'br';
                break;
            }

            // add two lines
            $output = "\n\n";
            break;

        case "tr":
            // add one line
            $output = "\n";
            break;

        case "div":
            $output = "";
            if ($prevName !== null) {
                // add one line
                $output .= "\n";
            }
            break;

        case "li":
            $output = "- ";
            break;

        default:
            // print out contents of unknown tags
            $output = "";
            break;
    }

    // debug
    //$output .= "[$name,$nextName]";

    if (isset($node->childNodes)) {

        $n = $node->childNodes->item(0);
        $previousSiblingNames = [];
        $previousSiblingName = null;

        $parts = [];
        $trailing_whitespace = 0;

        while ($n != null) {

            $text = self::iterateOverNode($n, $previousSiblingName, $in_pre || $name == 'pre', $is_office_document, $options);

            // Pass current node name to next child, as previousSibling does not appear to get populated
            if ($n instanceof \DOMDocumentType
                || $n instanceof \DOMProcessingInstruction
                || ($n instanceof \DOMText && self::isWhitespace($text))) {
                // Keep current previousSiblingName, these are invisible
                $trailing_whitespace++;
            }
            else {
                $previousSiblingName = strtolower($n->nodeName);
                $previousSiblingNames[] = $previousSiblingName;
                $trailing_whitespace = 0;
            }

            $node->removeChild($n);
            $n = $node->childNodes->item(0);

            $parts[] = $text;
        }

        // Remove trailing whitespace, important for the br check below
        while ($trailing_whitespace-- > 0) {
            array_pop($parts);
        }

        // suppress last br tag inside a node list if follows text
        $last_name = array_pop($previousSiblingNames);
        if ($last_name === 'br') {
            $last_name = array_pop($previousSiblingNames);
            if ($last_name === '#text') {
                array_pop($parts);
            }
        }

        $output .= implode('', $parts);
    }

    // end whitespace
    switch ($name) {
        case "h1":
        case "h2":
        case "h3":
        case "h4":
        case "h5":
        case "h6":
        case "pre":
        case "p":
            // add two lines
            $output .= "\n\n";
            break;

        case "br":
            // add one line
            $output .= "\n";
            break;

        case "div":
            break;

        case "a":
            // links are returned in [text](link) format
            // @phpstan-ignore-next-line
            $href = $node->getAttribute("href");

            $output = trim($output);

            // remove double [[ ]] s from linking images
            if (substr($output, 0, 1) == "[" && substr($output, -1) == "]") {
                $output = substr($output, 1, strlen($output) - 2);

                // for linking images, the title of the <a> overrides the title of the <img>
                // @phpstan-ignore-next-line
                if ($node->getAttribute("title")) {
                    // @phpstan-ignore-next-line
                    $output = $node->getAttribute("title");
                }
            }

            // if there is no link text, but a title attr
            // @phpstan-ignore-next-line
            if (!$output && $node->getAttribute("title")) {
                // @phpstan-ignore-next-line
                $output = $node->getAttribute("title");
            }

            if ($href == null) {
                // it doesn't link anywhere
                // @phpstan-ignore-next-line
                if ($node->getAttribute("name") != null) {
                    if ($options['drop_links']) {
                        $output = "$output";
                    } else {
                        $output = "[$output]";
                    }
                }
            } else {
                if ($href == $output || $href == "mailto:$output" || $href == "http://$output" || $href == "https://$output") {
                    // link to the same address: just use link
                    $output = "$output";
                } else {
                    // replace it
                    if ($output) {
                        if ($options['drop_links']) {
                            $output = "$output";
                        } else {
                            $output = "[$output]($href)";
                        }
                    } else {
                        // empty string
                        $output = "$href";
                    }
                }
            }

            // does the next node require additional whitespace?
            switch ($nextName) {
                case "h1": case "h2": case "h3": case "h4": case "h5": case "h6":
                    $output .= "\n";
                    break;
            }
            break;

        case "img":
            // @phpstan-ignore-next-line
            if ($node->getAttribute("title")) {
                // @phpstan-ignore-next-line
                $output = "[" . $node->getAttribute("title") . "]";
            // @phpstan-ignore-next-line
            } elseif ($node->getAttribute("alt")) {
                // @phpstan-ignore-next-line
                $output = "[" . $node->getAttribute("alt") . "]";
            } else {
                $output = "";
            }
            break;

        case "li":
            $output .= "\n";
            break;

        case "blockquote":
            // process quoted text for whitespace/newlines
            $output = self::processWhitespaceNewlines($output);

            // add leading newline
            $output = "\n" . $output;

            // prepend '> ' at the beginning of all lines
            $output = preg_replace("/\n/im", "\n> ", $output);

            // replace leading '> >' with '>>'
            $output = preg_replace("/\n> >/im", "\n>>", $output);

            // add another leading newline and trailing newlines
            $output = "\n" . $output . "\n\n";
            break;
        default:
            // do nothing
    }

    return $output;
}

}

Converting HTML to plain text in PHP for e-mail

15 Answers15

Linked

Related