24

I know we can use PHP DOM to parse HTML using PHP. I found a lot of questions here on Stack Overflow too. But I have a specific requirement. I have an HTML content like below

<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>

I want to parse the above HTML and save the content into two different arrays like:

$heading and $content

$heading = array('Chapter 1','Chapter 2','Chapter 3');
$content = array('This is chapter 1','This is chapter 2','This is chapter 3');

I can achieve this simply using jQuery. But I am not sure, if that's the right way. It would be great if someone can point me to the right direction. Thanks in advance.

hatef
  • 5,491
  • 30
  • 43
  • 46
laradev
  • 868
  • 1
  • 10
  • 22

5 Answers5

31

I have used domdocument and domxpath to get the solution, you can find it at:

<?php
$dom = new DomDocument();
$test='<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>';

$dom->loadHTML($test);
$xpath = new DOMXpath($dom);
    $heading=parseToArray($xpath,'Heading1-H');
    $content=parseToArray($xpath,'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray($xpath,$class)
{
    $xpathquery="//span[@class='".$class."']";
    $elements = $xpath->query($xpathquery);

    if (!is_null($elements)) {  
        $resultarray=array();
        foreach ($elements as $element) {
            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
              $resultarray[] = $node->nodeValue;
            }
        }
        return $resultarray;
    }
}

Live result: http://saji89.codepad.org/2TyOAibZ

saji89
  • 2,093
  • 4
  • 27
  • 49
  • I've found this link to be very useful to learn the XPATH.query syntax: https://www.w3schools.com/xml/xpath_syntax.asp – Nigini Jul 15 '20 at 20:45
20

Try to look at PHP Simple HTML DOM Parser

It has brilliant syntax similar to jQuery so you can easily select any element you want by ID or class

// include/require the simple html dom parser file

$html_string = '
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 1</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 1</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 2</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 2</span>
    </p>
    <p class="Heading1-P">
        <span class="Heading1-H">Chapter 3</span>
    </p>
    <p class="Normal-P">
        <span class="Normal-H">This is chapter 3</span>
    </p>';
$html = str_get_html($html_string);
foreach($html->find('span') as $element) {
    if ($element->class === 'Heading1-H') {
        $heading[] = $element->innertext;
    }else if($element->class === 'Normal-H') {
        $content[] = $element->innertext;
    }
}
iniravpatel
  • 1,553
  • 16
  • 24
Paul Denisevich
  • 2,329
  • 14
  • 19
  • 3
    !!NOTICE!! not using "->innertext" leads to memory leaks. – M at Jul 14 '19 at 21:19
  • 2
    This is a much easier option and produces more readable code compared to using DomDocument. – Stephen G Feb 23 '20 at 14:53
  • Is there an option to install that with composer? – luckydonald Jun 10 '20 at 17:09
  • 1
    Composer install [is now possible](https://sourceforge.net/p/simplehtmldom/news/2019/10/composer-package/): `composer require simplehtmldom/simlehtmldom dev-master` and `use simplehtmldom\HtmlWeb;` – luckydonald Jun 10 '20 at 17:13
  • @luckydonald there is a typo in your comment. missing the "p" in the second "simple" in the composer require command – Philip Dec 22 '22 at 21:55
  • @Philip yeah, that typo is in the [linked official source](https://sourceforge.net/p/simplehtmldom/news/2019/10/composer-package/). The corrected version would then be like this: `composer require simplehtmldom/simplehtmldom dev-master` and `use simplehtmldom\HtmlWeb;` – luckydonald Jan 04 '23 at 15:01
8

Here's an alternative way to parse the html using DiDOM which offers significantly better performance in terms of speed and memory footprint.

composer require imangazaliev/didom
<?php

use DiDom\Document;

require_once('vendor/autoload.php');

$html = <<<HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$document = new Document($html);

// find chapter headings
$elements = $document->find('.Heading1-H');

$headings = [];

foreach ($elements as $element) {
    $headings[] = $element->text();
}

// find chapter texts
$elements = $document->find('.Normal-H');

$chapters = [];

foreach ($elements as $element) {
    $chapters[] = $element->text();
}

echo("Headings\n");

foreach ($headings as $heading) {
    echo("- {$heading}\n");
}

echo("Chapter texts\n");

foreach ($chapters as $chapter) {
    echo("- {$chapter}\n");
}
8ctopus
  • 2,617
  • 2
  • 18
  • 25
  • 3
    Love it when you find an old question on SO with a really good modern answer. That DOM parser is excellent, cheers. – McNab Apr 27 '22 at 09:16
5

One option for you is to use DOMDocument and DOMXPath. They do require a bit of a curve to learn, but once you do, you will be pretty happy with what you can achieve.

Read the following in php.net

http://php.net/manual/en/class.domdocument.php

http://php.net/manual/en/class.domxpath.php

Hope this helps.

Greeso
  • 7,544
  • 9
  • 51
  • 77
-12

// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');

// Find all images

foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

// Find all links

foreach($html->find('a') as $element) 
   echo $element->href . '<br>';
Chen-Tsu Lin
  • 22,876
  • 16
  • 53
  • 63
jfraber
  • 607
  • 1
  • 5
  • 6