0

I'm in the process of creating a script for our internal customer support system. I want to collect emails from our IMAP inbox (hosted on Gmail) and parse the emails into the database.

What is the best way to clean frames, badly coded tags, and messy formatting so the result is a clean text with minimal formatting?

I'm aware Regular Expressions will most likely play heavily, but I want to know if this functionality exists in another library somewhere that I'm missing.

Edit: More specifically what needs removed:

All inline CSS/Styling, All HTML except simple formatting like Bold, Underline, and Italics.

Here's an email I'm using as a test case, It's a fairly beefy spam email I got from ZoneAlarm, It's got a bit of everything.

<td>
                    <br>
                    <br>


                    <table align="center" bgcolor="#749FD0" border="0" cellpadding="0" cellspacing="0" style="font-family:Arial,Helvetica,sans-serif;font-size:12px;line-height:16px;color:#555555" valign="top" width="700">
                        <tbody>
                            <tr>
                                <td>

                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10">
                                                    <img border="0" height="1" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" style="display: block; max-width: 2880px;" width="1"></td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/nw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">

                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2000&amp;&amp;&amp;http://www.zonealarm.com?cid=E200246" target="_blank"><img alt="ZoneAlarm by Check Point Software Technologies LTD." border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/za_transparent.png" width="120" style="display: block; max-width: 2880px;" title="ZoneAlarm by Check Point Software Technologies LTD."></a></td>
                                                <td align="right" style="font-family:Arial,Helvetica,sans-serif" width="150">
                                                    <span style="color:#999999;font-size:12px">Connect with ZoneAlarm</span></td>
                                                <td align="right" valign="middle" width="125">
                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2001&amp;&amp;&amp;http://www.facebook.com/ZoneAlarmFirewall" target="_blank"><img alt="ZoneAlarm Facebook" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/facebook.png" width="22" title="ZoneAlarm Facebook" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2002&amp;&amp;&amp;http://twitter.com/zonealarm" target="_blank"><img alt="ZoneAlarm Twitter" border="0" width="22" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/twitter.png" title="ZoneAlarm Twitter" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2003&amp;&amp;&amp;http://www.youtube.com/zonealarmsecurity" target="_blank"><img alt="ZoneAlarm YouTube" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/youtube.png" title="ZoneAlarm YouTube" height="22" style="max-width: 2880px;"></a><img border="0" height="15" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" width="10" style="max-width: 2880px;"></td>
                                                    <td bgcolor="#E3ECEC" rowspan="6" align="center" valign="top" width="1">
                                                <img align="right" height="32" src="http://download.zonealarm.com/bin/images/emails/welcome/borderx1.png" width="1" style="max-width: 2880px;">
                                                    </td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/sw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">

Mike McCormick
  • 121
  • 1
  • 1
  • 9

1 Answers1

1

You can use HTML Purifier for this, see: http://htmlpurifier.org/

Marco Veenendaal
  • 186
  • 1
  • 12