9

Do you know any good HTML to plain text conversion class written in PHP?

I need it for converting HTML mail body to plain text mail body.

I wrote simple function, but I need more features like converting tables, adding links at the end, converting nested lists…

-- regards
takeshin

hakre
  • 193,403
  • 52
  • 435
  • 836
takeshin
  • 49,108
  • 32
  • 120
  • 164
  • Why not just send HTML mail? I understand that faking tables is kind of possible in plaintext but every email reader in the world reads HTML, why not save yourself the trouble of pointless conversion because you or somebody else refuses to use HTML mail. – TravisO Dec 18 '09 at 19:58
  • 5
    TravisO: Not every reader. And some do not automatically convert HTML into plain text. For a user the raw HTML is usually not nice to read :-) – Joey Dec 18 '09 at 20:07
  • 1996 is over, get use to it. But of course the elitist types who loathe HTML email are going to be the most vocal/willing to vote those ideals up. – TravisO Dec 18 '09 at 21:00
  • 3
    There is a lot of people who do not like to read fancy emails. Have you seen your HTML email on old phones? – takeshin Dec 20 '09 at 13:19
  • Inside the limesurvey project you'll find a chunk of code that is working together with the PHPMailer class to create alt-bodies of HTML email templates. As far as I know it's pretty well tailored for commercial grade single HTML templates that can be written in a way that the plain text variants don't look like shit. If you compare that as well with the javacsript part that is the integration of the ckeditor component, you should be able to create great templates in no time. – hakre Jul 04 '12 at 00:19
  • Possible duplicate of [Converting HTML to plain text in PHP for e-mail](http://stackoverflow.com/questions/1884550/converting-html-to-plain-text-in-php-for-e-mail) – Cullub Jan 20 '16 at 14:47
  • 1996 is over,... and now we have smartwatches. If you don't include clean alternate text in your email, users will see an ugly mess [source Litmus report on email status 2016, chapter about the iWatch] Welcome to 2016, @TravisO! https://litmus.com/lp/2016-state-of-email-report – FrancescoMM Jan 31 '16 at 11:03

7 Answers7

6

I'd suggest using a HTML to Markdown converter.

Saugat
  • 1,309
  • 14
  • 22
ceejayoz
  • 176,543
  • 40
  • 303
  • 368
  • 1
    Uh, have you used or read anything about Markdown? "The overriding design goal for Markdown’s formatting syntax is to make it as readable as possible. **The idea is that a Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions.**" – ceejayoz Dec 18 '09 at 23:55
  • 2
    Markdownify is a good solution, indeed. I looked at it before, but I thought that it does not converts tables. But the problem was, that I tried on tables with `` attributies and some css styles. I stripped manually captions and class and style attributies, and it works nice. – takeshin Dec 20 '09 at 13:15
5

A particular mail sending implementation around here simply spawns lynx with the HTML and uses its output for the text version. It's not perfect but works. You might also use links or elinks.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • Yes, this was already suggested on StackOverflow, but I was asking for PHP soultion. I do not have access to lynx to my server. Thanks. – takeshin Dec 20 '09 at 13:12
  • You forgot to mention that you need the `-dump` arg to lynx – JoelFan Apr 13 '11 at 03:19
  • This generally works very good, you can normally use it with a good, content centric web-design out-of-the-box. Since ages. +1 – hakre Jul 04 '12 at 00:22
3

Using lynx is an option only if you have permission to run executables on the server. Doing so, however, is not considered a good practice. Furthermore, in secure hosts the php process is limited to be unable to spawn bash sessions, which are required for running lynx.

The most complete solution written entirely in PHP I was able to find is the Horde_Text_Filter_Html2text class. It is a part from the Horde framework.

Other solutions I've tried include:

If someone got the perfect solution, please, post it back for further reference!

Nikola Petkanski
  • 4,724
  • 1
  • 33
  • 41
3

You can use lynx with -stdin and -dump options to achieve that:

<?php
$descriptorspec = array(
   0 => array("pipe", "r"),  // stdin is a pipe that the child will read from
   1 => array("pipe", "w"),  // stdout is a pipe that the child will write to
   2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to
);

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL);

if (is_resource($process)) {
    // $pipes now looks like this:
    // 0 => writeable handle connected to child stdin
    // 1 => readable handle connected to child stdout
    // Any error output will be appended to htmp2txt.log

    $stdin = $pipes[0];
    fwrite($stdin,  <<<'EOT'
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
 <title>TEST</title>
</head>
<body>
<h1><span>Lorem Ipsum</span></h1>

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4>
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis.
</p>
<p>
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui.
</p>
</body>
</html>
EOT
    );
    fclose($stdin);

    echo stream_get_contents($pipes[1]);
    fclose($pipes[1]);

    // It is important that you close any pipes before calling
    // proc_close in order to avoid a deadlock
    $return_value = proc_close($process);

    echo "command returned $return_value\n";
}
nad2000
  • 4,526
  • 1
  • 30
  • 25
2

As the question is about PHP and I found Dharmesh Hadiyal's c# code quite useful, I have converted it to PHP.

(can't comment, not enough reputation)

//From https://stackoverflow.com/questions/1930297/html-to-plain-text-for-email/23988241#23988241
//converted from c# to PHP
class HtmlToText
{
    public static function stripHTML($source)
    {
        // Remove HTML Development formatting
        // Replace line breaks with space
        // because browsers inserts space
        $result = str_replace("\r",  " ",$source );
        // Replace line breaks with space
        // because browsers inserts space
        $result = str_replace("\n",  " ",$result );
        // Remove step-formatting
        $result = str_replace("\t",  "",$result );
        // Remove repeating spaces because browsers ignore them
        $result = preg_replace("/( )+/im",  " ", $result);


        // Remove html-Tag (prepare first by clearing attributes)
        $result = preg_replace("/<( )*html([^>])*>\s*/im",  "<html>", $result);

        $result = preg_replace("/(<( )*(\/)( )*html( )*>)/im",  "</html>", $result);

        $result = preg_replace("/(<html>)|(<\/html>)/im",  "", $result);

        // Remove the header (prepare first by clearing attributes)
        $result = preg_replace("/<( )*head([^>])*>/im",  "<head>", $result);

        $result = preg_replace("/(<( )*(\/)( )*head( )*>)/im",  "</head>", $result);

        $result = preg_replace("/(<head>).*(<\/head>)/im",  "", $result);


        // remove all scripts (prepare first by clearing attributes)
        $result = preg_replace("/<( )*script([^>])*>/im",  "<script>", $result);

        $result = preg_replace("/(<( )*(\/)( )*script( )*>)/im",  "</script>", $result);

        //$result = System.Text.RegularExpressions.Regex.Replace($result,
        //         "(<script>)([^(<script>\.</script>)])*(</script>)",
        //         "",
        //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        $result = preg_replace("/(<script>).*(<\/script>)/im",  "", $result);


        // remove all styles (prepare first by clearing attributes)
        $result = preg_replace("/<( )*style([^>])*>/im",  "<style>", $result);

        $result = preg_replace("/(<( )*(\/)( )*style( )*>)/im",  "</style>", $result);

        $result = preg_replace("/(<style>).*(<\/style>)/im",  "", $result);


        // insert tabs in spaces of <td> tags
        $result = preg_replace("/<( )*td([^>])*>/im",  "\t", $result);


        // insert line breaks in places of <BR> and <LI> tags
        $result = preg_replace("/<( )*br( )*\/?>/im",  "\r", $result);

        $result = preg_replace("/<( )*li( )*>/im",  "\r", $result);


        // insert line paragraphs (double line breaks) in place
        // if <P>, <DIV> and <TR> tags
        $result = preg_replace("/<( )*div([^>])*>/im",  "\r\r", $result);

        $result = preg_replace("/<( )*tr([^>])*>/im",  "\r\r", $result);

        $result = preg_replace("/<( )*p([^>])*>/im",  "\r\r", $result);


        // Remove remaining tags like <a>, links, images,
        // comments etc - anything that's enclosed inside < >
        $result = preg_replace("/<[^>]*>/im",  "", $result);


        // replace special characters:
        $result = preg_replace("/ /im",  " ", $result);


        $result = preg_replace("/&bull;/im",  " * ", $result);

        $result = preg_replace("/&lsaquo;/im",  "<", $result);

        $result = preg_replace("/&rsaquo;/im",  ">", $result);

        $result = preg_replace("/&trade;/im",  "(tm)", $result);

        $result = preg_replace("/&frasl;/im",  "/", $result);

        $result = preg_replace("/&lt;/im",  "<", $result);

        $result = preg_replace("/&gt;/im",  ">", $result);

        $result = preg_replace("/&copy;/im",  "(c)", $result);

        $result = preg_replace("/&reg;/im",  "(r)", $result);

        // Remove all others. More can be added, see
        // http://hotwired.lycos.com/webmonkey/reference/special_characters/
        $result = preg_replace("/&(.{2,6});/im",  "", $result);


        // for testing
        //System.Text.RegularExpressions.Regex.Replace($result,
        //       this.txtRegex.Text,"",
            //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // make line breaking consistent
        $result = str_replace("\n",  "\r",$result );

        // Remove extra line breaks and tabs:
        // replace over 2 breaks with 2 and over 4 tabs with 4.
        // Prepare first to remove any whitespaces in between
        // the escaped characters and remove redundant tabs in between line breaks
        $result = preg_replace("/(\r)( )+(\r)/im",  "\r\r", $result);

        $result = preg_replace("/(\t)( )+(\t)/im",  "\t\t", $result);

        $result = preg_replace("/(\t)( )+(\r)/im",  "\t\r", $result);

        $result = preg_replace("/(\r)( )+(\t)/im",  "\r\t", $result);

        // Remove redundant tabs
        $result = preg_replace("/(\r)(\t)+(\r)/im",  "\r\r", $result);

        // Remove multiple tabs following a line break with just one tab
        $result = preg_replace("/(\r)(\t)+/im",  "\r\t", $result);

        // Initial replacement target string for line breaks
        $breaks = "\r\r\r";
        // Initial replacement target string for tabs
        $tabs = "\t\t\t\t\t";
        for ($index = 0; $index < strlen($result); $index++)
        {
            $result = str_replace($breaks,  "\r\r",$result );
            $result = str_replace($tabs,  "\t\t\t\t",$result );
            $breaks = $breaks . "\r";
            $tabs = $tabs . "\t";
        }

        //remove spaces at the beginning of a line
        $result = preg_replace("/^ +/im",  "", $result);

        //line breaks at the beginning/end is probably unwanted. Coluld be left over by removing <html>/<head>/<body>
        $result = trim($result);

        // That's it.
        return $result;
    }
}
1

In c# :

private string StripHTML(string source)
{
    try
    {
        string result;

        // Remove HTML Development formatting
        // Replace line breaks with space
        // because browsers inserts space
        result = source.Replace("\r", " ");
        // Replace line breaks with space
        // because browsers inserts space
        result = result.Replace("\n", " ");
        // Remove step-formatting
        result = result.Replace("\t", string.Empty);
        // Remove repeating spaces because browsers ignore them
        result = System.Text.RegularExpressions.Regex.Replace(result,
                                                              @"( )+", " ");

        // Remove the header (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*head([^>])*>", "<head>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*head( )*>)", "</head>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(<head>).*(</head>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // remove all scripts (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*script([^>])*>", "<script>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*script( )*>)", "</script>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        //result = System.Text.RegularExpressions.Regex.Replace(result,
        //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
        //         string.Empty,
        //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<script>).*(</script>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // remove all styles (prepare first by clearing attributes)
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*style([^>])*>", "<style>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"(<( )*(/)( )*style( )*>)", "</style>",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(<style>).*(</style>)", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert tabs in spaces of <td> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*td([^>])*>", "\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert line breaks in places of <BR> and <LI> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*br( )*>", "\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*li( )*>", "\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // insert line paragraphs (double line breaks) in place
        // if <P>, <DIV> and <TR> tags
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*div([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*tr([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<( )*p([^>])*>", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // Remove remaining tags like <a>, links, images,
        // comments etc - anything that's enclosed inside < >
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"<[^>]*>", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // replace special characters:
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @" ", " ",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&bull;", " * ",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&lsaquo;", "<",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&rsaquo;", ">",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&trade;", "(tm)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&frasl;", "/",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&lt;", "<",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&gt;", ">",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&copy;", "(c)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&reg;", "(r)",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove all others. More can be added, see
        // http://hotwired.lycos.com/webmonkey/reference/special_characters/
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 @"&(.{2,6});", string.Empty,
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // for testing
        //System.Text.RegularExpressions.Regex.Replace(result,
        //       this.txtRegex.Text,string.Empty,
        //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        // make line breaking consistent
        result = result.Replace("\n", "\r");

        // Remove extra line breaks and tabs:
        // replace over 2 breaks with 2 and over 4 tabs with 4.
        // Prepare first to remove any whitespaces in between
        // the escaped characters and remove redundant tabs in between line breaks
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)( )+(\r)", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\t)( )+(\t)", "\t\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\t)( )+(\r)", "\t\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)( )+(\t)", "\r\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove redundant tabs
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)(\t)+(\r)", "\r\r",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Remove multiple tabs following a line break with just one tab
        result = System.Text.RegularExpressions.Regex.Replace(result,
                 "(\r)(\t)+", "\r\t",
                 System.Text.RegularExpressions.RegexOptions.IgnoreCase);
        // Initial replacement target string for line breaks
        string breaks = "\r\r\r";
        // Initial replacement target string for tabs
        string tabs = "\t\t\t\t\t";
        for (int index = 0; index < result.Length; index++)
        {
            result = result.Replace(breaks, "\r\r");
            result = result.Replace(tabs, "\t\t\t\t");
            breaks = breaks + "\r";
            tabs = tabs + "\t";
        }

        // That's it.
        return result;
    }
    catch
    {
        MessageBox.Show("Error");
        return source;
    }
}
Dharmesh Hadiyal
  • 719
  • 1
  • 7
  • 18
1

I know the question is about PHP, but I used the lynx idea to make this Perl subroutine to convert HTML to text:

use File::Temp;

sub html2Txt {
    my $html = shift;
    my $htmlF = File::Temp->new(SUFFIX => '.html');
    print $htmlF $html;
    close $htmlF;
    return scalar `/usr/bin/lynx -dump $htmlF 2> /dev/null`;
}

print html2Txt '<b>Hi there</b> Testing';

prints: Hi there Testing

JoelFan
  • 37,465
  • 35
  • 132
  • 205