Regex to get html element content by class

Question

I've seen a few questions like this with the accepted answers being to use a HTML parser. BUT if I HAD to use regex (php), how could I get the span text in the below examples based on the class name.

<span class="phone-number" data-id="999" style="{lots of random stuff here}">+61 9900 0000</span>
<span class="email" data-something="xxx" style="{lots of random stuff here}">test@test.com</span>

So my variables would be element type and class name.

With my basic knowledge, I've gotten this far:

(?<=span class="phone-number")\s+(.*?)(?=<\/span>)

but that includes the data and style attributes.

I don't think regex is the best option here. Why not use a DOM parser? — Felippe Duarte, Jan 15 '18 at 22:37
I just pasted this in another question. So here it is to you too: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Jorge Campos, Jan 15 '18 at 22:38
There's a reason every experienced developer says not to use regex to parse HTML. Trust the experience and use a parser. If you think you can't use a parser, can you explain why not? — jhilgeman, Jan 15 '18 at 22:42
There is no reason not to use a parser, considering DOMDocument is built in and requires no external libs. Regex is the wrong tool for HTML as HTML is too volatile. — Lawrence Cherone, Jan 15 '18 at 22:47
@jhilgeman my reason is this: all other matching mechanisms in this existing application use regex on text versions of HTML emails. These patterns are stored in an array for easy access and are chosen depending on input. The raw HTML is available (as plain text won't work for this particular email) but the program flow must use the regex array or else it must be completely re-written. — Warren, Jan 15 '18 at 22:53
Is there any issue with `(?<=)(.*?)(?=<\/span>)` (https://regex101.com/r/vb8d7X/1/) — Warren, Jan 15 '18 at 22:54
Why not just detect the HTML and add a different flow / processing function for that scenario so the existing text emails continue to work as-is? — jhilgeman, Jan 15 '18 at 22:58
I'm guessing the downvotes are because people don't like the question. There is nothing wrong with the question at all. I know I should use DOM parsing but I don't have that option so the question is about how to use regex in this situation. — Warren, Jan 16 '18 at 02:39

score 2 · Answer 1 · answered Jan 15 '18 at 22:57

Don't use regex to parse HTML, instead you should use DOMDocument and DOMXPath.

<?php
$html = '
<span class="phone-number" data-id="999" style="{lots of random stuff here}">+61 9900 0000</span>
<span class="email" data-something="xxx" style="{lots of random stuff here}">test@test.com</span>
';

$dom = new DOMDocument;
$dom->loadHtml($html);

$xpath = new DOMXPath($dom);

$phone = $xpath->query("//span[contains(@class, 'phone-number')]");
$email = $xpath->query("//span[contains(@class, 'email')]");

echo $phone->item(0)->nodeValue.PHP_EOL; //+61 9900 0000
echo $email->item(0)->nodeValue.PHP_EOL; //test@test.com

/*
// loop if have multiple
foreach ($phone as $value) {
    echo $value->nodeValue;
}
*/

https://3v4l.org/qbVaS

This will be helpful for future endeavors when I have the option to use HTML parsing. — Warren, Jan 16 '18 at 02:37

Regex to get html element content by class

1 Answers1