0

I have some html files that contain the same tags with different strings between these tags , I want to get strings from specific tags and after it finds the first match then this string is the only added to the array , for more details see this code.

The html:

<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <h1>Some Text</h1>
    <p>This is the first Paragraph</p>
    <ul>
      <li></li>
      <li></l1>
    </ul>
    <p>This is the second Pharagraph</p>
  </body>
</html>

The html files will contain more elements

I want to get the text inside the first <p> only and prevent wasting time searching the whole html file while I just want to get one value from a specific tag.

The PHP:

//Loop inside all the HTML files inside a folder
$files = glob("files/*.html");
foreach($files as $file){ 
    //Get the whole content of each HTMl file
    $content = file_get_contents($file);
    //Search for specific tag
    preg_match_all('#<p>(.*?)<\/p>', $content, $matches);
}

I only want to add the value of the first match to the $matches.

I can't edit the html code to add class or id to the tags I want to get values from because I'm not the one who created them and I can't edit all the files manually

I don't mind using another way to get these values but it should achieve what I want (only the first match then it's stopped searching the whole file)

tommy
  • 45
  • 1
  • 1
  • 2
  • 1
    Use `preg_match`? or more better DOM? – revo Nov 30 '17 at 18:01
  • What do you think the "all" in `preg_match_all` stands for ...? Might it have a counterpart without that ...? – CBroe Nov 30 '17 at 18:02
  • 1
    [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ctwheels Nov 30 '17 at 18:02
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – CAustin Nov 30 '17 at 18:03
  • @CAustin , How is it duplicated ? , I'm asking different question that is not only about regular expressions!! – tommy Nov 30 '17 at 18:12
  • The question might be different, but it has the same answer. – CAustin Nov 30 '17 at 18:15

1 Answers1

0

You can do this with DomDocument.

<?php 
$html = '<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <h1>Some Text</h1>
    <p>This is the first Paragraph</p>
    <ul>
      <li></li>
      <li></l1>
    </ul>
    <p>This is the second Pharagraph</p>
  </body>
</html>';

$err = libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
libxml_use_internal_errors($err);

// find all p tags, select the first, get its value
$pValue = $dom->getElementsByTagName('p')->item(0)->nodeValue;

//This is the first Paragraph
echo $pValue;

https://3v4l.org/kjFoC

So if you wanted to add to your code, perhaps do it like:

<?php 
function getFirstParagraph($src) {
    $err = libxml_use_internal_errors(true);
    $dom = new DOMDocument();
    $dom->loadHTML($src);
    libxml_clear_errors();
    libxml_use_internal_errors($err);

    return $dom->getElementsByTagName('p')->item(0)->nodeValue;
}

//Loop inside all the HTML files inside a folder
$files = glob("files/*.html");
foreach($files as $file){ 
    //Get the whole content of each HTMl file
    $content = file_get_contents($file);
    //
    $matches[] = getFirstParagraph($content);
}
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106