Not use regex to parse html

Question

While answering a question I found myself struggling not to use a regex to parse html.

How should I get the url value of style="background:url(http...) using only an html parser?

<a href="http://goruzont.blogspot.com/2017/04/blog-post_6440.html" style="background:url(https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg) no-repeat center center;background-size:cover">

To make it clear, I need:

https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg

What specifically should I check there that answers the question? — Pedro Lobito, Apr 23 '17 at 18:09
It's titled _How to parse HTML the right way, without regular expressions_. A helpful pointer perhaps. — Rahul, Apr 23 '17 at 18:12
Do not lead to absurdity :). This phrase is in the context of html search. Parsing value of `style` by regex is quite correct — splash58, Apr 23 '17 at 18:12
@splash58 Could please answer the question using only an html parser? It's a challenge for me! — Pedro Lobito, Apr 23 '17 at 18:13
Of course. i can use string functions of Xpath to crop the string but think it will be slow — splash58, Apr 23 '17 at 18:16
@Rahul I know ( and can always learn more) *How to parse HTML the right way* , but that article doesn't address the problem in my question. — Pedro Lobito, Apr 23 '17 at 18:20
Is this fake question to push the idea of not parsing html with regex ? Absurd !! Just take off the regex tag, and all references to regex in your question. Html parsers _use_ regex to parse tags. Has nothing to do with nested content. `<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` — , Apr 23 '17 at 19:51
The question is legit. Sure they do, but the majority of dom parsers coders know what they're doing, not like a lot of people... It's completely different to use an html parser based on regex than using regex to parse html. Don't take this too personally ;) — Pedro Lobito, Apr 23 '17 at 20:01

score 3 · Accepted Answer · answered Apr 23 '17 at 18:23

3

Without regex:

$dom = new DomDocument;
$dom->loadHTML('
<a href="http://goruzont.blogspot.com/2017/04/blog-post_6440.html" style="background:url(https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg) no-repeat center center;background-size:cover">
');
$xpath = new DomXpath($dom);
echo $xpath->evaluate('substring-before(substring-after(string(//a/@style), "background:url("), ")")');

Demo

answered Apr 23 '17 at 18:23

splash58

26,043
3
22
34

As you said, it's slow but it does answer the question correctly. tks! – Pedro Lobito Apr 23 '17 at 18:24
1

Don't mention it :) – splash58 Apr 23 '17 at 18:26

Not use regex to parse html

1 Answers1