0

I can't get the following Regex to work in PHP. Basically I am trying to take some horrendous Outlook HTML that contains a numbered list, remove the HTML, then Regex the plain text to get the list.

If I take the text that is produced by strip_tags() and test it on regex101.com, it finds the ordered list just fine. If I use that same regex in preg_match_all in PHP it produces an empty array.

Fiddles and regex101 below:

PHP:

$calendar_code = '
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style>
<!--
@font-face
{font-family:"Cambria Math"}
@font-face
{font-family:Calibri}
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
a:link, span.MsoHyperlink
{color:#0563C1;
text-decoration:underline}
a:visited, span.MsoHyperlinkFollowed
{color:#954F72;
text-decoration:underline}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{margin-top:0cm;
margin-right:0cm;
margin-bottom:0cm;
margin-left:36.0pt;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif}
p.msonormal0, li.msonormal0, div.msonormal0
{margin-right:0cm;
margin-left:0cm;
font-size:12.0pt;
font-family:"Times New Roman",serif}
span.EmailStyle19
{font-family:"Calibri",sans-serif;
color:windowtext}
.MsoChpDefault
{font-size:10.0pt;
font-family:"Calibri",sans-serif}
@page WordSection1
{margin:72.0pt 72.0pt 72.0pt 72.0pt}
div.WordSection1
{}
ol
{margin-bottom:0cm}
ul
{margin-bottom:0cm}
-->
</style>
</head>
<body lang="EN-GB" link="#0563C1" vlink="#954F72">
<div class="WordSection1">
<p class="MsoNormal">This is a test of the agenda and objectives format</p>
<p class="MsoNormal">&nbsp;</p>
<p class="MsoNormal">This shouldn’t get picked up</p>
<p class="MsoNormal">&nbsp;</p>
<p class="MsoNormal">Dasdasdasd d asda sd&nbsp; : asd obe: sad neither shood this</p>
<p class="MsoNormal">&nbsp;</p>
<p class="MsoNormal">Objective: This is how the object should look, this is a long one</p>
<p class="MsoNormal">&nbsp;</p>
<p class="MsoNormal">Agenda:</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">1.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span>Make like a tree</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">2.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span>And</p>
<p class="MsoListParagraph" style="text-indent:-18.0pt"><span style="">3.<span style="font:7.0pt &quot;Times New Roman&quot;">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
</span></span>Get out of here</p>
<p class="MsoNormal">&nbsp;</p>
<p class="MsoNormal">Some more stuff here, and here and ::: ;s</p>
<p class="MsoNormal">Sadfdsf sdfdfeswrfew </p>
<p class="MsoNormal">&nbsp;</p>
</div>
</body>
</html>
';

$strip = strip_tags($calendar_code);

echo "<pre>";
echo $strip;

preg_match_all("/^(\d+\.)\s+([^\r\n]+)(?:[\r\n]*)/m", $strip, $matches);
print_r($matches);

PHPFiddle: http://phpfiddle.org/main/code/ygut-5jj5

As you can see I echo out the HTML stripped text. When I put this text in to regex101.com it works perfectly. See here: https://regex101.com/r/wW1kC9/1

I thought it might have something to do with the line endings, but I replaced all the HTML line endings with \n before doing the strip_tags() and it still doesn't work.

Can anyone see why this regex is not working with preg_match_all()

UPDATE:

It's been pointed out that non-breaking spaces are the reason, so removing or allowing for them in the regex will fix it. However it has also been pointed out that as the format of these lists will be quite random depending on the email client that sends the list, some using <ol> and some not for example, regex will not work for every situation, or even the majority of situations.

I need a better way of getting the contents of lists created by any number of different email clients.

For some background, people create these lists in emails and send them to a special email account. My code then accesses these emails and retrieves the lists for use elsewhere in my app. As these list are being created in the many different email clients available, they will invariably have different(random) formatting applied. For example when you create lists in Outlook 2016, it adds <p> and <span> tags with styling to create the list.

superphonic
  • 7,954
  • 6
  • 30
  • 63
  • 2
    `\s` won't catch ` ` and you have a line break after the all ` ` in each list item that you don't cover in your regex. – hsan Aug 09 '16 at 16:19
  • Use `echo htmlentities($strip);` to see what's really in the variable. – Barmar Aug 09 '16 at 16:21
  • Change `\s+` to `(?:\s| )+` – Barmar Aug 09 '16 at 16:22
  • Yep!, can't believe I didn't notice the non-breaking spaces. I'll remove them before doing the `preg_match_all()` as they may or may not be there all the time. – superphonic Aug 09 '16 at 16:26
  • It is much more reliable to do HTML parsing with DOMDocument than with regular expressions. – trincot Aug 09 '16 at 16:58
  • 1
    @hsan line-breaks are matched with `\s` – revo Aug 09 '16 at 16:58
  • @trincot actually he is not dealing with HTML part. – revo Aug 09 '16 at 16:59
  • @revo, he is, as he is extracting text from a enumerated list wrapped in `p` and `span` tags. – trincot Aug 09 '16 at 17:08
  • @trincot Again, this problem doesn't have anything to do with HTML, specifically *parsing* which you mean it, there is no parsing involved so is simply done as the way OP goes. – revo Aug 09 '16 at 17:15
  • @revo, maybe I am indeed missing the point, but up until now I see that the part that is being extracted by [his reference to reg101.com](https://regex101.com/r/wW1kC9/1) is indeed the part that is in the HTML block of his input. I have provided an answer, which illustrates that using DOMDocument gives the output the OP seems to be looking for. Even the OP speaks of *HTML stripped text*... meaning the original is HTML. – trincot Aug 09 '16 at 17:33
  • DOM is a way but OP's problem is not tied to HTML. Definition of a proper tool, to my thinking, is changeable on problem context not just by seeing a part of code which immediately brings a solution: *HTML? DOM!* @trincot – revo Aug 09 '16 at 17:43

3 Answers3

1

You have to decode HTML entities:

$strip = html_entity_decode(strip_tags($calendar_code));

Then there is another tricky part that you should take care of: after this decoding a non-breaking space will turn into its hex representation 0xC2 0xA0 which is not matched by \s token anymore so you have to consider its Unicode code point 00a0 as well:

preg_match_all("/^(\d+\.)[\s\x{00a0}]+([^\r\n]+)(?:[\r\n]*)/mu", $strip, $matches);

Live demo

revo
  • 47,783
  • 14
  • 74
  • 117
  • Note that if in *$calendar_code* the `p` tags would not have a newline between them (so no characters would exist between `` and the next `

    `), it would still represent the same output, but the regex would return an undesirable result. Similarly, if one of the text contents had an embedded newline instead of a space, it would not represent a different output (since there is no CSS present that indicates some preservation of white space), yet the regex would return an undesirable output. These are just two examples of what can go wrong using this way of dealing with the text extraction.

    – trincot Aug 09 '16 at 19:51
  • A Regular Expression does really mean it: *regular*. To behave *regular*, input can't be *irregular*. It is not about `p` tags only but it could be a minified version of current HTML format which means no new-line characters at all. You are not talking about *regulars* so any *undesirable result* is likely to happen. Poster deals with plain text and is supposed to know how RegExes are going to work. You can bother to imagine about all possible falling cases but simply they are not going to happen while our input is considered *regular* otherwise we have no idea about what we're doing @trincot – revo Aug 09 '16 at 20:15
  • That is why working code starts failing after a while: it is when too much is assumed about what is regular. I refer to [the famous answer](http://stackoverflow.com/a/1732454/5459839) on this. – trincot Aug 09 '16 at 20:23
  • I should add that the word *regular* in *regular expression* has nothing to do with the level of regularity of the input. The term comes from *regular grammar* which is [a specific formal grammar](https://en.wikipedia.org/wiki/Regular_grammar). So it says something about the language, not about the input. Secondly, HTML strings that have newlines instead of blanks or vice versa are not more (ir)regular than others. – trincot Aug 09 '16 at 20:34
  • I smell rationalizing. It's clear a Regular Expression needs a known input to keep to work and by *regular* it is enough for me that you got what I meant. Finally, You don't need to repeatedly refer to a topic which I have - not less than you - referred to. @trincot – revo Aug 09 '16 at 20:36
1

It works with this "/^(\d+\.)(?:&nbsp;|\s)+([^\r\n]+)(?:[\r\n]*)/m"

apparently entities are not being removed.

You could remove entities after strip tag's with this regex

(?i)[%&](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)));

I would just remove them, decoding them might produce unwanted (or undecoded)
characters.

1

Here is an alternative solution that does not use strip_tags nor regular expressions to parse HTML (only to parse plain text), but uses the DOM API instead. This is much more reliable:

function unicodeTrim($str) {
    return preg_replace('/^[\pZ\pC]+|[\pZ\pC]+$/u', '', $str);
}

$doc = new DOMDocument();
$doc->loadHTML($calendar_code);
$xpath = new DOMXpath($doc);
$nodes = $xpath->query('//p[@class="MsoListParagraph"]');
foreach($nodes as $p) {
    // Use the number as array index, and the part after the dot as its value
    $result[intval($p->nodeValue)] = unicodeTrim(explode(".", $p->nodeValue, 2)[1]);
}
print_r($result);

Output when applied to the sample data:

Array
(
    [1] => Make like a tree
    [2] => And
    [3] => Get out of here
)

See it run on eval.in.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Thanks for this, unfortunately parsing the DOM is not going to be possible as I cannot guarantee the HTML format. The HTML is created by any number of different email clients, outlook, gmail, apple mail, yahoo etc... They each format numbered list differently once the email is sent, even different versions of outlook format it differently, some use ordered lists, some use this stupid `span/paragraph` version I posted here. The only way to catch most formats is to strip the HTML and then regex it as best as possible... – superphonic Aug 10 '16 at 07:57
  • That indeed changes the question, where you wrote about *"some horrendous Outlook HTML that contains a numbered list"*. If however you expect to have *ordered lists* (i.e. using the `
      ` tag), then you won't even get the numbers in the text. Those numbers would be generated by HTML rendering only. So, I wonder how `strip_tags` could help you with that. In my opinion, this is one more reason not to use that function.
    – trincot Aug 10 '16 at 08:10
  • That's a good point ordered lists not having the number. This just got a lot harder. I guess then I need something that will 1) Parse the HTML and change any `
      ` to actual numbers in front for the list. 2) Remove the HTML and entities. 3) Rearrange the text so that all groups of characters are on one line seperated by a single space. 4) Regex that line to get the list ?? Unless you can think of a better way? I have control over how the user formats the list, maybe I make them use (1) test. (2) text. (3) text etc.. instead, then use regex for that?
    – superphonic Aug 10 '16 at 08:16
  • Could you tell a bit more about where the users get their template from? Does your application provide it? – trincot Aug 10 '16 at 08:22
  • You write that you *have control over how the user formats the list*. When I read the paragraphs you added to the question, I wonder how you have control over it. Can you explain? If you have some kind of control, then I would look for a solution there. – trincot Aug 10 '16 at 08:35
  • I have control in as much as I can tell them how to create the list. I could say that the list needs to start with the word **LIST:**, and that each number before the list item has to have a hyphen in front of it etc..(-1.). Doing this I guess would stop the email client auto formatting it as a list(in whatever way it does that), but not sure it helps with me getting the list on my end... – superphonic Aug 10 '16 at 08:37
  • True, you could instruct users to use specific delimiters. But even then, some email editors will change a hyphen to a list which generates the hyphen through CSS or other `li` attribute, again leaving out the hyphen from the pure text content. Is it really necessary to use email for this? You could deal with this very easily via a web form that the user needs to submit. Email is not really the best channel for structured information. – trincot Aug 10 '16 at 12:18