1

I need to capture specific tags from a HTML page using PHP.

A single HTML document can have multiple results (Multiline as well). Also ONLY need to match tags if it includes a data-uid value.

  • Tag name (div, span etc...)
  • data-uid's value
  • Children nodes.

So far, I was able to capture tag name, data-uid's value. But not Children nodes.

<div class="testClassOne" data-uid="123456">
    <div class="testClassTwo">Content</div>
    <-- More nodes -->
</div>

Result: { tag: "div", data-uid: 123456, childrens: "<div class="testClassTwo">Content</div>" }

or

<div class="testClassOne" data-uid="123456"></div>

Result: { tag: "div", data-uid: 123456, childrens: " " }

My current Regex and the function are as follow...

$regex = '/<(.*) (?:.*?)data-uid="([^"]*?)"(?:.*?)>(.*?)<\/\1>/';
$content = preg_replace_callback($regex, 'test', $content);

function test($arg){
    print_r($arg);
}

Does anyone know to resolve this issue (Capture childrens as a string as well?) ?

stackminu
  • 791
  • 2
  • 9
  • 24
  • 1
    you'd be **far** better off doing this with DOM parsing; using regex for this kind of task gets complicated, and ends up being rather brittle – landru27 Jun 01 '18 at 20:39
  • 2
    [Do not parse HTML with Regex](https://stackoverflow.com/a/1732454/5827005). – GrumpyCrouton Jun 01 '18 at 20:42
  • @landru27 I tried to do this with DOMDocument as well. But failed, Not achieved this far. Any suggestion to catch tagName, data-uid as well as children in an efficient way? – stackminu Jun 01 '18 at 20:43
  • @stackminu : if you have fully researched, tried, and failed with DOM parsing, you'd be far better off posting a SO question detailing what is not working with your DOM parsing, rather than giving up, switching to regex, failing there too, and posting to SO about your regex attempts; in other words, go back to DOM parsing; future you will thank you greatly – landru27 Jun 01 '18 at 20:54

1 Answers1

1

As stated by others, use a DOM parser with xpath expressions instead.
The following expression

$items = $xpath->query("//*[@data-uid]");

will query the dom for all elements having data-uid as an attribute and will return a list. Afterwards, you can call getAttribute() on each item.


In PHP:
<?php

$data = <<<DATA
<div class="testClassOne" data-uid="123456">
    <div class="testClassTwo">Content</div>
    <-- More nodes -->
</div>
DATA;

$dom = new DOMDocument();

# suppress warnings
libxml_use_internal_errors(true);
$dom->loadHTML($data);
libxml_clear_errors();

# set up an xpath expression
$xpath = new DOMXPath($dom);
$items = $xpath->query("//*[@data-uid]");

foreach ($items as $item) {
    echo "tagname: " . $item->tagName . "\n";
    echo "uid: " . $item->getAttribute("data-uid") . "\n";
    foreach($item->getElementsByTagName('*') as $child ){
        print_r($child);
    }   
}

?>


This yields
tagname: div
uid: 123456
DOMElement Object
(
    [tagName] => div
    [schemaTypeInfo] => 
    [nodeName] => div
    [nodeValue] => Content
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => div
    [baseURI] => 
    [textContent] => Content
)
Jan
  • 42,290
  • 8
  • 54
  • 79