I can't get the following Regex to work in PHP. Basically I am trying to take some horrendous Outlook HTML that contains a numbered list, remove the HTML, then Regex the plain text to get the list.
If I take the text that is produced by strip_tags()
and test it on regex101.com, it finds the ordered list just fine. If I use that same regex in preg_match_all
in PHP it produces an empty array.
Fiddles and regex101 below:
PHP:
$calendar_code = '
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
{color:#0563C1;
text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
p.msonormal0, li.msonormal0, div.msonormal0
{margin-right:0cm;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif}
span.EmailStyle19
{font-family:"Calibri",sans-serif;
color:windowtext}
.MsoChpDefault
{font-size:10.0pt;
font-family:"Calibri",sans-serif}
@page WordSection1
{margin:72.0pt 72.0pt 72.0pt 72.0pt}
div.WordSection1
{}
ol
{margin-bottom:0cm}
ul
{margin-bottom:0cm}
-->
</style>
</head>
<body lang="EN-GB" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">This is a test of the agenda and objectives format</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">This shouldn’t get picked up</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Dasdasdasd d asda sd : asd obe: sad neither shood this</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Objective: This is how the object should look, this is a long one</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Agenda:</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">1.<span style="font:7.0pt "Times New Roman"">
</span></span>Make like a tree</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">2.<span style="font:7.0pt "Times New Roman"">
</span></span>And</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">3.<span style="font:7.0pt "Times New Roman"">
</span></span>Get out of here</p>
<p class="MsoNormal"> </p>
<p class="MsoNormal">Some more stuff here, and here and ::: ;s</p>
<p class="MsoNormal">Sadfdsf sdfdfeswrfew </p>
<p class="MsoNormal"> </p>
</div>
</body>
</html>
';
$strip = strip_tags($calendar_code);
echo "<pre>";
echo $strip;
preg_match_all("/^(\d+\.)\s+([^\r\n]+)(?:[\r\n]*)/m", $strip, $matches);
print_r($matches);
PHPFiddle: http://phpfiddle.org/main/code/ygut-5jj5
As you can see I echo out the HTML stripped text. When I put this text in to regex101.com it works perfectly. See here: https://regex101.com/r/wW1kC9/1
I thought it might have something to do with the line endings, but I replaced all the HTML line endings with \n
before doing the strip_tags()
and it still doesn't work.
Can anyone see why this regex is not working with preg_match_all()
UPDATE:
It's been pointed out that non-breaking spaces are the reason, so removing or allowing for them in the regex will fix it. However it has also been pointed out that as the format of these lists will be quite random depending on the email client that sends the list, some using <ol>
and some not for example, regex will not work for every situation, or even the majority of situations.
I need a better way of getting the contents of lists created by any number of different email clients.
For some background, people create these lists in emails and send them to a special email account. My code then accesses these emails and retrieves the lists for use elsewhere in my app. As these list are being created in the many different email clients available, they will invariably have different(random) formatting applied. For example when you create lists in Outlook 2016, it adds <p>
and <span>
tags with styling to create the list.