I'm trying to extract the 6 fields Sender, Customer ID and so on from the body of an html email:
$string = '... some other html text ... <p>
<strong>Sender:</strong> Holly Schöne<br>
<strong>Customer ID:</strong> 3853XXXX<br>
<strong>Email:</strong> email@test.net<br>
<strong>Transaction ID:</strong> 836248467<br>
<strong>Reference:</strong> product<br>
<strong>Explanation:</strong> Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
</p>... some more html text ...';
... that I extract like that:
$message = imap_fetchbody($inbox, $email_number, $section);
// determine $encoding and $charset
$decodedMessage = decodeMessage($message, $encoding, $charset);
using this function: (cases for the other encodings are left out since nothing is done there)
function decodeMessage($message, $encoding, $charset) {
switch ($encoding) {
case 3: // BASE64
$message = base64_decode($message);
break;
case 4: // QUOTED-PRINTABLE
$message = quoted_printable_decode($message);
break;
default:
break;
}
if ($charset != NULL) {
$message = mb_convert_encoding($message , 'utf-8' , $charset);
//$message = mb_convert_encoding($message , 'iso-8859-1' , $charset);
}
return $message;
}
That all works like a charm. The problem starts here:
$regex = '/\<p\>[\w\W. ]*?\<strong\>Sender\:\<\/strong\> (?<sender>[\w\W ]+?)\<br\>.*?\<strong\>Customer ID\:\<\/strong\> (?<customerId>[\w\W ]+?)\<br\>.*?\<strong\>Email\:\<\/strong\> (?<email>[\w\W ]+?)\<br\>.*?\<strong\>Transaction ID\:\<\/strong\> (?<transactionId>[\w\W ]+?)\<br\>.*?\<strong\>Reference\:\<\/strong\> (?<reference>[\w\W ]+?)\<br\>.*?\<strong\>Explanation\:\<\/strong\> (?<explanation>[\w\W ]+?)\<\/p\>/is';
$result = preg_match($regex, $decodedMessage, $matches);
If I apply that regex to the string above I get exactly what I wanted - an array like this:
print_r($matches) = Array (
[0] => <p>
<strong>Sender:</strong> Holly Schöne<br>
<strong>Customer ID:</strong> 3853XXXX<br>
<strong>Email:</strong> email@test.net<br>
<strong>Transaction ID:</strong> 836248467<br>
<strong>Reference:</strong> product<br>
<strong>Explanation:</strong> Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
</p>
[sender] => Holly Schöne
[1] => Holly Schöne
[customerId] => 3853XXXX
[2] => 3853XXXX
[email] => email@test.net
[3] => email@test.net
[transactionId] => 836248467
[4] => 836248467
[reference] => product
[5] => product
[explanation] => Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
[6] => Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
)
... however if I do the same with the $decodedMessage I get:
preg_last_error() -> PREG_NO_ERROR
$result -> [empty string]
$matches -> array()
I tried everything and looked around but I just can't figure out the problem. My guess is it has to do with the encoding or the character set of the email body. Any help would be greatly appreciated.
Well you asked for it - I thought this question was already very long ... here is the vardump - I only changed some personal details
... ah damn ... and there was my problem as well
I let myself be fooled by Waterfox's source code viewer
it showed <br />
as <br>
and added a <tbody>
to each table so the source code I based my regex on was not the one the email actually had - I feel pretty foolish now - actual HTML source code below
<html>
<table width="750" cellpadding="0" cellspacing="0">
<tr>
<td style="background-repeat:no-repeat;" background="http://i1.mbsvr.net/images/bg_mailframe.gif" width="100%" align="center">
<table width="95%" align="center">
<tr>
<td align="left" style="padding:10px 0 0 10px;">
<a href="http://www.moneybookers.com/app/?l=EN" target="_blank" style="color:FD932C;font-weight:normal;" onfocus="this.blur()">
<img src="http://i1.mbsvr.net/images/skrill/mb-logo-the-future.png" border="0" />
</a>
</td>
</tr>
</table>
<table width="740">
<tr><td style="padding:0px 40px 0px 0px" align="center">
<table width="100%" border="0" cellpadding="0" cellspacing="0">
<tr>
<td valign="top" align="middle">
<table cellspacing="0" cellpadding="0" width="100%" border="0">
<tr>
<td>
<hr style="!important; font-family: verdana, arial, sans-serif; border: 0; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
</td>
</tr>
<tr>
<td style="!important; font-family: verdana, arial, sans-serif; margin: 0; padding: 0px 0px 10px 0px; color: #EF8116; font-weight: bold; font-size: 18px;" nowrap width="50%">
You have received EUR 0.05
</td>
</tr>
<tr>
<td style="!important; font-family: verdana, arial, sans-serif; font-size: 11px; color: #656565;">
<br/>
Dear Mmmmmmm Bbbbbbb,<br />
<br/>
Holly Schöne has sent you EUR 0.05 via Skrill (Moneybookers). The full details of the transaction are:<br />
<p>
<strong>Sender:</strong> Holly Schöne<br />
<strong>Customer ID:</strong> 3853XXXX<br />
<strong>Email:</strong> email@test.net<br />
<strong>Transaction ID:</strong> 836151721<br />
<strong>Reference:</strong> TPBwishes<br />
<strong>Explanation:</strong> Holly Schoene
#gsg4sda65g4r65e4g8s4g56asd54e#
</p>
Your money is waiting for you in your Skrill (Moneybookers) account - <a href="https://www.moneybookers.com">https://www.moneybookers.com</a>.<br />
<br />
<b>IMPORTANT:</b> If you are using Skrill (Moneybookers) commercially, we <b>STRONGLY</b> advise that you check in your Skrill (Moneybookers) account history that the money is there.<br />
<br />
Have you increased your withdrawal and receiving limits? Just log into your Skrill (Moneybookers) account and click <b>View Limits</b> in the "My Account" section.<br />
<br />
Kind regards,<br />
Skrill (Moneybookers)<br />
</td>
</tr>
<tr>
<td>
<hr style="!important; font-family: verdana, arial, sans-serif; border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
</td>
</tr>
</table>
<table cellspacing=0 cellpadding=0 width="100%" border=0>
<tr>
<td style="font-family: verdana, arial, sans-serif; font-size: 12px; color: #656565;"><b>Skrill (Moneybookers) Security Reminders</b></td>
</tr>
<tr>
<td class=smooth valign="top" style="font-family: verdana, arial, sans-serif; font-size: 11px; color: #656565;"><p> <br> <strong>Protect Your Password</strong><br>Skrill (Moneybookers) and its representatives will NEVER ask you to reveal your password. There are NO EXCEPTIONS to this policy. If anyone asks for your password by phone or by email, or on any website other than moneybookers.com, refuse and immediately report this to <a href="mailto:security@moneybookers.com" style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Access your account ONLY using the login link on the Moneybookers homepage</strong><br>Please be advised that Skrill (Moneybookers) and its representatives will NEVER send you an email asking you to provide your login details within a form provided or to click on a hyperlink to access your account! Immediately report any incident to <a href="mailto:security@moneybookers.com" style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Case Sensitive Login</strong><br>Please remember your password is case-sensitive, at least 8 characters long and contains at least one number or non-alphabetic character such as '-'. <br> <br> </p></td> </tr> </table>
</td>
</tr>
</table> </td></tr>
<tr>
<td style="padding:0px 54px 0px 0px" class="separator"><hr style="border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;"/></td>
</tr>
</table>
<table align="left" width="740">
<tr>
<td width="10"> </td>
<td style="font-family: verdana, arial, sans-serif; font-size: 11px; color: #656565;" valign="top" width="100%" align="center">
Moneybookers Ltd., London, Registered in England and Wales no 4260907.<br>
Registered office: Welken House, 10-11 Charterhouse Square, London, EC1M 6EH, United Kingdom.<br>
Authorised by the Financial Services Authority (FSA) under the Electronic Money Regulations 2011 for the issuing of electronic money.
</td>
</tr>
</table>
</td>
<tr>
<td valign="top">
<img src="http://i1.mbsvr.net/images/bg_mailframe_bottom.gif" border="0" />
</td>
</tr>
</table>
</html>
So together with Tomalak's answer I now got two working solutions:
my now working regex that considers correctly closed <br />
tags and now also parses the value:
$regex = '/<td .*?>.*?You have received(?<value>.+?\d+\.\d\d).*?<\/td>.*?<p>.*?<strong>Sender:<\/strong>(?<sender>.+?)<br*.?\/?>.*?<strong>Customer ID:<\/strong>(?<customerId>.+?)<br*.?\/?>.*?<strong>Email:<\/strong>(?<email>.+?)<br*.?\/?>.*?<strong>Transaction ID:<\/strong>(?<transactionId>.+?)<br*.?\/?>.*?<strong>Reference:<\/strong>(?<reference>.+?)<br*.?\/?>.*?<strong>Explanation:<\/strong>(?<explanation>.+?)<\/p>/is';
and the adjusted xpath to Tomalak's solution below:
$path = "p/strong[contains(., '$info')]/following-sibling::text()[1]";
no slashes at the beginning means: find that xpath anywhere in the DOM tree and well it only matches where I want it too
thanks for everyone who tried to help