Extract html attributes from string in PHP

Question

I have a variable that looks like this:

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'

and I want to extract the data-tpl-attributes in a way so I end up with a resulting array that looks like this:

$array = (
    'classname' => 'class',
    'title' => 'innerHTML'
)

The number of "data-tpl-" attributes varies, and it's not always an <li> element. Other than that, it always follows the same format: data-tpl-attributename="attributePlacement".

How can I retrieve those attributes and store them in an array, without using regex? I say without regex since everywhere I look it seems like parsing html using regex is an evil practice, or is it ok in this case?

Look into the [DOMDocument](http://www.php.net/manual/en/class.domdocument.php) class. — PeeHaa, Apr 24 '14 at 07:53
If it's always in that form, have you tried an HTML or XML parser? — Adrian Wragg, Apr 24 '14 at 07:54
@AdrianWragg No I have not. Not familiar with it. Will have a look. — Weblurk, Apr 24 '14 at 07:56

score 9 · Answer 1 · edited Apr 24 '14 at 08:27

You can very well make use of a DOMDocument class and yeah don't use regular expressions. This is just a start and you can very well explore it.

<?php
$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
echo "<pre>";

function parseTag($content,$tg)
{
    $dom = new DOMDocument;
    $dom->loadHTML($content);
    $attr = array();
    foreach ($dom->getElementsByTagName($tg) as $tag) {
        foreach ($tag->attributes as $attribName => $attribNodeVal)
        {
           $attr[$attribName]=$tag->getAttribute($attribName);
        }
    }
    return $attr;
}

$attrib_arr = parseTag($var,'li');
print_r($attrib_arr);

OUTPUT :

Array
(
    [data-tpl-classname] => class
    [data-tpl-title] => innerHTML
)

Demo

Balázs Varga · Accepted Answer · 2014-04-24T10:30:15.867

You can extract the values by using some string functions. It looks like this:

$test1 = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>';
$test2 = '<div data-tpl-anything="something" data-tpl-title="this is a title" data-tpl-third="asdasd"></div>';

var_dump(extract_tpl($test1));
var_dump(extract_tpl($test2));

function extract_tpl($string,$prefix="data-tpl-") {
    $start = 0;
    $end = 0;

    while(strpos($string,$prefix,$end))
    {
        $start = strpos($string,$prefix,$start)+strlen($prefix);
        $end = strpos($string,'"',$start)-1;
        $end2 = strpos($string,'"',$end+2);
        $array[substr($string,$start,$end-$start)] = substr($string,$end+2,$end2-$end-2);
    }

    return $array;
}

Output:

array (size=2)
  'classname' => string 'class' (length=5)
  'title' => string 'innerHTML' (length=9)

array (size=3)
  'anything' => string 'something' (length=9)
  'title' => string 'this is a title' (length=15)
  'third' => string 'asdasd' (length=6)

The numbers in the code ( -1, +2, ... ) is for skipping the symbols like " .

Tronix117 · Answer 3 · 2014-04-24T08:34:15.947

It's evil without being it completely, of course, it may be slow on big strings or on really complex regexp, which is not your case. And it is still (more readable?), easier and quicker to implement than HTML or XML parser, which are not more optimized than a simple regexp match.

$var = '<li data-tpl-classname="class" data-tpl-title="innerHTML"></li>'
preg_match_all("data-tpl-([^"]*)="([^"]*)"/i", $str, $matches);

$array = array();
for($i = 1, $size = count($matches); $i < $size; ++$i){
  $array[$matches[$i][0]] = $matches[$i][1];
}

I used [^"]* instead of .*? since it is a bit quicker.

Note: I just made a benchmark. Compared to the first answer using DOMDocument, this code using Regexp is 4 time faster, but less cleaner since parsing Dom using regexp may lead to misinterpretations of the markup. And it is slightly slower than the answer using str functions (but easier to read and to maintain).

Note 2: Of course use this solution only if there will never be any confusion and if you are sure of the input format, in the contrary the solution with DOMDocument is cleaner.

Why regular expression should be used wisely or avoided when parsing HTML:

http://blog.codinghorror.com/parsing-html-the-cthulhu-way

Use them with that in mind:

It's generally a bad idea.

Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it.

I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario.

_compared to the first answer using DOMDocument, this code using Regexp is 4 time faster, I just made a benchmark._ Are you serious ? Please don't use regex for parsing HTML , You will get a downvote rain on your answer soon. — Shankar Narayana Damodaran, Apr 24 '14 at 08:17
You need to know I'm only speaking about which one run fast (in this case the answer using only str functions is slightly faster but harder to read and maintain). I never said it was cleaner, but I don't consider it evil, it all depends about use cases. — Tronix117, Apr 24 '14 at 08:26
http://blog.codinghorror.com/parsing-html-the-cthulhu-way/ , I am done with. — Shankar Narayana Damodaran, Apr 24 '14 at 08:27
I agree to, and these are the important rules: * It's generally a bad idea. // * Unless you have discipline and put very strict conditions on what you're doing, matching HTML with regular expressions rapidly devolves into madness, just how Cthulhu likes it. // * I had what I thought to be good, rational, (semi) defensible reasons for choosing regular expressions in this specific scenario. — Tronix117, Apr 24 '14 at 08:31

Extract html attributes from string in PHP

3 Answers3