1

While answering a question I found myself struggling not to use a regex to parse html.

How should I get the url value of style="background:url(http...) using only an html parser?

<a href="http://goruzont.blogspot.com/2017/04/blog-post_6440.html" style="background:url(https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg) no-repeat center center;background-size:cover">

To make it clear, I need:

https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg
Community
  • 1
  • 1
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
  • [Check this out.](http://htmlparsing.com/php.html) – Rahul Apr 23 '17 at 18:08
  • What specifically should I check there that answers the question? – Pedro Lobito Apr 23 '17 at 18:09
  • It's titled _How to parse HTML the right way, without regular expressions_. A helpful pointer perhaps. – Rahul Apr 23 '17 at 18:12
  • Do not lead to absurdity :). This phrase is in the context of html search. Parsing value of `style` by regex is quite correct – splash58 Apr 23 '17 at 18:12
  • @splash58 Could please answer the question using only an html parser? It's a challenge for me! – Pedro Lobito Apr 23 '17 at 18:13
  • Of course. i can use string functions of Xpath to crop the string but think it will be slow – splash58 Apr 23 '17 at 18:16
  • @Rahul I know ( and can always learn more) *How to parse HTML the right way* , but that article doesn't address the problem in my question. – Pedro Lobito Apr 23 '17 at 18:20
  • https://eval.in/781300 – splash58 Apr 23 '17 at 18:22
  • Is this fake question to push the idea of not parsing html with regex ? Absurd !! Just take off the regex tag, and all references to regex in your question. Html parsers _use_ regex to parse tags. Has nothing to do with nested content. `<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` –  Apr 23 '17 at 19:51
  • The question is legit. Sure they do, but the majority of dom parsers coders know what they're doing, not like a lot of people... It's completely different to use an html parser based on regex than using regex to parse html. Don't take this too personally ;) – Pedro Lobito Apr 23 '17 at 20:01

1 Answers1

3

Without regex:

$dom = new DomDocument;
$dom->loadHTML('
<a href="http://goruzont.blogspot.com/2017/04/blog-post_6440.html" style="background:url(https://1.bp.blogspot.com/-6vpIH5iqPYs/WPzlNdxsRpI/AAAAAAAAntU/d7U_Ch_6FiIPwosNL4tWwqBeXw8qwo2nACLcB/s1600/1424051.jpg) no-repeat center center;background-size:cover">
');
$xpath = new DomXpath($dom);
echo $xpath->evaluate('substring-before(substring-after(string(//a/@style), "background:url("), ")")');

Demo

splash58
  • 26,043
  • 3
  • 22
  • 34