why does PHPs preg_match work differently on strings vs. string_literals for extracting data into a named array?

Question

I'm trying to extract the 6 fields Sender, Customer ID and so on from the body of an html email:

$string = '... some other html text ... <p>
   <strong>Sender:</strong>&nbsp;Holly Schöne<br>
   <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
   <strong>Email:</strong>&nbsp;email@test.net<br>
   <strong>Transaction ID:</strong>&nbsp;836248467<br>
   <strong>Reference:</strong>&nbsp;product<br>
   <strong>Explanation:</strong>&nbsp;Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
</p>... some more html text ...';

... that I extract like that:

$message = imap_fetchbody($inbox, $email_number, $section);
// determine $encoding and $charset
$decodedMessage = decodeMessage($message, $encoding, $charset);

using this function: (cases for the other encodings are left out since nothing is done there)

function decodeMessage($message, $encoding, $charset) {
    switch ($encoding) {
        case 3: // BASE64
            $message = base64_decode($message);
            break;
        case 4: // QUOTED-PRINTABLE
            $message = quoted_printable_decode($message);
            break;
        default:
            break;
    }
    if ($charset != NULL) {
        $message = mb_convert_encoding($message , 'utf-8' , $charset);
        //$message = mb_convert_encoding($message , 'iso-8859-1' , $charset);
    }
    return $message;
}

That all works like a charm. The problem starts here:

$regex = '/\<p\>[\w\W. ]*?\<strong\>Sender\:\<\/strong\>&nbsp;(?<sender>[\w\W ]+?)\<br\>.*?\<strong\>Customer ID\:\<\/strong\>&nbsp;(?<customerId>[\w\W ]+?)\<br\>.*?\<strong\>Email\:\<\/strong\>&nbsp;(?<email>[\w\W ]+?)\<br\>.*?\<strong\>Transaction ID\:\<\/strong\>&nbsp;(?<transactionId>[\w\W ]+?)\<br\>.*?\<strong\>Reference\:\<\/strong\>&nbsp;(?<reference>[\w\W ]+?)\<br\>.*?\<strong\>Explanation\:\<\/strong\>&nbsp;(?<explanation>[\w\W ]+?)\<\/p\>/is';
$result = preg_match($regex, $decodedMessage, $matches);

If I apply that regex to the string above I get exactly what I wanted - an array like this:

print_r($matches) = Array (
    [0] => <p>
       <strong>Sender:</strong>&nbsp;Holly SchÃ¶ne<br>
       <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
       <strong>Email:</strong>&nbsp;email@test.net<br>
       <strong>Transaction ID:</strong>&nbsp;836248467<br>
       <strong>Reference:</strong>&nbsp;product<br>
       <strong>Explanation:</strong>&nbsp;Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    </p>
    [sender] => Holly SchÃ¶ne
    [1] => Holly SchÃ¶ne
    [customerId] => 3853XXXX
    [2] => 3853XXXX
    [email] => email@test.net
    [3] => email@test.net
    [transactionId] => 836248467
    [4] => 836248467
    [reference] => product
    [5] => product
    [explanation] => Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    [6] => Holly SchÃ¶ne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
)

... however if I do the same with the $decodedMessage I get:

preg_last_error() -> PREG_NO_ERROR
$result -> [empty string]
$matches -> array()

I tried everything and looked around but I just can't figure out the problem. My guess is it has to do with the encoding or the character set of the email body. Any help would be greatly appreciated.

Well you asked for it - I thought this question was already very long ... here is the vardump - I only changed some personal details

... ah damn ... and there was my problem as well

I let myself be fooled by Waterfox's source code viewer

it showed <br /> as <br> and added a <tbody> to each table so the source code I based my regex on was not the one the email actually had - I feel pretty foolish now - actual HTML source code below

<html>
<table width="750" cellpadding="0" cellspacing="0">
    <tr>
        <td style="background-repeat:no-repeat;" background="http://i1.mbsvr.net/images/bg_mailframe.gif" width="100%" align="center">
            <table width="95%" align="center">
                <tr>
                    <td align="left" style="padding:10px 0 0 10px;">
                        <a href="http://www.moneybookers.com/app/?l=EN" target="_blank" style="color:FD932C;font-weight:normal;" onfocus="this.blur()">
                            <img src="http://i1.mbsvr.net/images/skrill/mb-logo-the-future.png" border="0" />
                        </a>
                    </td>
                </tr>
            </table>
            <table width="740">
                <tr><td style="padding:0px 40px 0px 0px" align="center">
<table width="100%" border="0" cellpadding="0" cellspacing="0">
    <tr>
        <td valign="top" align="middle">
            <table cellspacing="0" cellpadding="0" width="100%" border="0">
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
                <tr> 
                    <td style="!important; font-family: verdana, arial, sans-serif; margin: 0; padding: 0px 0px 10px 0px; color: #EF8116; font-weight: bold; font-size: 18px;" nowrap width="50%">
                        You have received EUR 0.05
                    </td>
                </tr>
                <tr>
                    <td style="!important; font-family: verdana, arial, sans-serif;  font-size: 11px;   color: #656565;">
                        <br/> 
                        Dear Mmmmmmm Bbbbbbb,<br />
                        <br/>
                        Holly Schöne has sent you EUR 0.05 via Skrill (Moneybookers). The full details of the transaction are:<br />
                        <p>
                            <strong>Sender:</strong> Holly Schöne<br />
                            <strong>Customer ID:</strong> 3853XXXX<br />
                            <strong>Email:</strong> email@test.net<br />
                            <strong>Transaction ID:</strong> 836151721<br />
                            <strong>Reference:</strong> TPBwishes<br />
                            <strong>Explanation:</strong> Holly Schoene
#gsg4sda65g4r65e4g8s4g56asd54e#
                        </p>
                        Your money is waiting for you in your Skrill (Moneybookers) account - <a href="https://www.moneybookers.com">https://www.moneybookers.com</a>.<br />
                        <br />
                        <b>IMPORTANT:</b> If you are using Skrill (Moneybookers) commercially, we <b>STRONGLY</b> advise that you check in your Skrill (Moneybookers) account history that the money is there.<br />
                        <br />
                        Have you increased your withdrawal and receiving limits? Just log into your Skrill (Moneybookers) account and click <b>View Limits</b> in the "My Account" section.<br />
                        <br />
                        Kind regards,<br />
                        Skrill (Moneybookers)<br />
                    </td>
                </tr>
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
            </table>
            <table cellspacing=0 cellpadding=0 width="100%" border=0> 
    <tr>
        <td style="font-family: verdana, arial, sans-serif; font-size: 12px;    color: #656565;"><b>Skrill (Moneybookers) Security Reminders</b></td>
    </tr>
       <tr>          

<td class=smooth valign="top" style="font-family: verdana, arial, sans-serif; font-size: 11px;  color: #656565;"><p> <br>              <strong>Protect Your Password</strong><br>Skrill (Moneybookers) and its representatives will NEVER ask you to reveal your password. There are NO EXCEPTIONS to this policy. If anyone asks for your password by phone or by email, or on any website other than moneybookers.com, refuse and immediately report this to <a href="mailto:security@moneybookers.com" style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Access your account ONLY using the login link on the Moneybookers homepage</strong><br>Please be advised that Skrill (Moneybookers) and its representatives will NEVER send you an email asking you to provide your login details within a form provided or to click on a hyperlink to access your account! Immediately report any incident to <a href="mailto:security@moneybookers.com"              style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Case Sensitive Login</strong><br>Please remember your password is case-sensitive, at least 8 characters long and contains at least one number or non-alphabetic character such as '-'. <br>              <br>            </p></td>                </tr>      </table>
        </td>
    </tr>
</table>                </td></tr>
                <tr>
                    <td style="padding:0px 54px 0px 0px" class="separator"><hr style="border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;"/></td>
                </tr>
            </table>
            <table align="left" width="740">
                <tr>
                    <td width="10"> </td>
                    <td style="font-family: verdana, arial, sans-serif; font-size: 11px;    color: #656565;" valign="top" width="100%" align="center">
                    Moneybookers Ltd., London, Registered in England and Wales no 4260907.<br>
Registered office: Welken House, 10-11 Charterhouse Square, London, EC1M 6EH, United Kingdom.<br>
Authorised by the Financial Services Authority (FSA) under the Electronic Money Regulations 2011 for the issuing of electronic money.
                    </td>
                </tr>
            </table>
        </td>
    <tr>
        <td valign="top">
            <img src="http://i1.mbsvr.net/images/bg_mailframe_bottom.gif" border="0" />
        </td>
    </tr>
</table>
</html>

So together with Tomalak's answer I now got two working solutions:

my now working regex that considers correctly closed <br /> tags and now also parses the value:

$regex = '/<td .*?>.*?You have received(?<value>.+?\d+\.\d\d).*?<\/td>.*?<p>.*?<strong>Sender:<\/strong>(?<sender>.+?)<br*.?\/?>.*?<strong>Customer ID:<\/strong>(?<customerId>.+?)<br*.?\/?>.*?<strong>Email:<\/strong>(?<email>.+?)<br*.?\/?>.*?<strong>Transaction ID:<\/strong>(?<transactionId>.+?)<br*.?\/?>.*?<strong>Reference:<\/strong>(?<reference>.+?)<br*.?\/?>.*?<strong>Explanation:<\/strong>(?<explanation>.+?)<\/p>/is';

and the adjusted xpath to Tomalak's solution below:

$path = "p/strong[contains(., '$info')]/following-sibling::text()[1]";

no slashes at the beginning means: find that xpath anywhere in the DOM tree and well it only matches where I want it too

thanks for everyone who tried to help

[Tony the Pony, he comes!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) var_dump($decodedMessage); and paste it here. — Matt, Mar 24 '13 at 15:40
DOMDocument is an interesting idea - the thing is I can get the data out a few different ways but I liked the idea of applying one regex and get the data in a neat named array without having to traverse the dom-tree - besides can I always be certain that the email body will be fully traversable html? probabaly not - so this is now mostly about trying to understand why it does not work this way — Holly, Mar 24 '13 at 16:38
Can you write the contents of $decodedMessage to a text file so that we can compare the contents of that to the value of $string? — Captain Payalytic, Mar 24 '13 at 17:08
Are you sure what you added is the return value of the call to `decodeMessage`, as I tried your regex on it and it worked fine. BTW all the `[\w\W. ]*?` and such-like can be simply `.*?`, and there is no need for all the backslashes as only the `/` need escaping. — MikeM, Mar 24 '13 at 17:22
`can I always be certain that the email body will be fully traversable html?` The same applies for your Regex — dan-lee, Mar 24 '13 at 18:46
@MikeM I know that .*? should work if the s at then end is give but somehow it still did not always work correctly so I added the \w\W even though they should be redundant, all the escaping was an act of desperation - originally I had only the / escaped - but thanks for confirming that — Holly, Mar 24 '13 at 18:47
What values are being passed to `decodeMessage` as the `$encoding` and `$charset`? — MikeM, Mar 24 '13 at 19:03
@MikeM I first call $overview = imap_fetch_overview($inbox, $email_number, 0); and use the data I get in there - so $encoding is usually 3 or 4 and $charset is 'utf-8' or 'iso-8895-1' - the message is multipart (1.1 - TEXT, 1.2 - HTML) and I process each part seperatly - using the HTML part seemed easier - not so sure about that now — Holly, Mar 24 '13 at 20:40

Tomalak · Accepted Answer · 2013-03-24T20:39:53.553

0

Just for the sake of it, here is an implementation that avoids regex altoegther.

$doc = new DOMDocument();
$doc->loadHTML($decodedMessage);
$xpath = new DOMXPath($doc);

$info = array(
  'sender'         => get_info($xpath, 'Sender:'),
  'customer_id'    => get_info($xpath, 'Customer ID:'),
  'email'          => get_info($xpath, 'Email:'),
  'transaction_id' => get_info($xpath, 'Transaction ID:'),
  'reference'      => get_info($xpath, 'Reference:'),
  'explanation'    => get_info($xpath, 'Explanation:')
);


function get_info($xpath_object, $info) 
{
    $result = null;
    $path   = "//strong[contains(., '$info')]/following-sibling::text()[1]";
    $nodes  = $xpath_object->query($path);

    foreach ($nodes as $node)
    {
        $result = $node->textContent;
        break;
    }

    return $result;
}

edited Mar 24 '13 at 20:39

answered Mar 24 '13 at 19:30

Tomalak

332,285
67
532
628

OK I have to say I like that approach even though its not an answer to my question (if it worked), but I only get "result" as the value in each array field) – Holly Mar 24 '13 at 20:35
Typo. You could have seen it (I used `result` instead of `$result` and you have PHP error reporting off). It's still not said that my code works right-away, it's a nudge into the right direction. The regex-based approach is doomed, and you should not pursue it. – Tomalak Mar 24 '13 at 20:41
ah damn I looked at the content of $notes but did not see that - well PHP is not my language - I'm a JAVA and C# person and used to strong typing ... - am doing this to help my bf and to learn something - changed it and now get NULL for all values - error reporting is off since he gave me access to his live forum server for this ... - I guess I need to look into it more, but I can see that this is a good approach – Holly Mar 24 '13 at 20:49
Well, it most likely is null because the XPath is not exactly right for your situation or something is not right with DOMDocument/DOMXpath. Try to read yourself into XPath for a few minutes, it's not *that* hard to get. You can switch on error reporting on a per-page-level, I'd recommend that; see http://stackoverflow.com/questions/845021/. – Tomalak Mar 24 '13 at 21:06
as described above I found the problem and now both solutions work - since your answer added a new and probably better performing solutions I'm going to accept it as the correct answer even though its not an actual answer to the original question - which is now pointless any way - I just did not know what else to do with the question - answer it myself? change the title? – Holly Mar 26 '13 at 20:35
In the end what matters is that you've got your problem solved. It's not required that you solved it the way you planned. So if accepting an alternative solution works for you, it would also probably be valuable to other people who find this. However, if you worked out a different solution yourself, post it as an answer so others might benefit as well. -- What was the problem with my answer, BTW? – Tomalak Mar 26 '13 at 20:56
the double slash at the beginning starts searching at the current node and that did not work at first - although it may have been because I added extra stuff to the message so turned it into invalid html - I did not try again if your solution would now work unchanged – Holly Mar 27 '13 at 09:29
The double slash should start at the document root, not at the current node. Also, since my code passes the complete XPath object for context, there is no place to start except the root node. The problem must have been somewhere else, I suspect the XPath expression was not matching your document structure. – Tomalak Mar 27 '13 at 13:34

why does PHPs preg_match work differently on strings vs. string_literals for extracting data into a named array?

1 Answers1