1

[TL;DR] Need to parse the html to extract the tabs and content using PHP

I am migrating data from a Joomla / Hikashop site exported via a CSV file. The tabs are defined by content within a P tag as follows

<p> </p>
<p style="text-align: center;"><span style="text-decoration: underline;"><span style="font-size: 14pt;"><strong>Strong Item</strong></span></span></p>
<p> {tab=Description}</p>
<p>This is a default description</p>
<ul>
<li>It has</li>
<li>mixed content</li>
</ul>
<p>{tab=Features} </p>
<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>
<p>{/tabs}</p>

I need to extract the tab name followed by the content.

I can pull out the tabs easy enough

$crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {

But it's getting the content between tabs that is throwing me.

Description =

<ul>
<li>It has</li>
<li>mixed content</li>
</ul>

Features=

<ul style="list-style-type: circle;">
<li>It's good</li>
<li>I like it</li>
</ul>
<p>It does what I want</p>
<p> </p>

Obviously I can regex it and loop through lines etc.. but that is prone to error

Thanks

Darren
  • 21
  • 4
  • This is an example product description I want to populate a MySQL database with the field name and the content of the tab – Darren Oct 19 '21 at 21:49
  • I don't really know what is confusing about it, the content returned is the html between the tabs, the tab name is the tab=XXX – Darren Oct 19 '21 at 22:30
  • Are the "tabs" markers always on the highest level in the document? or might they be nested in a lower level? – mickmackusa Oct 19 '21 at 22:43
  • 1
    I think you'll need to bake in a few extra pieces of logic, but this looks like the way forward: https://stackoverflow.com/q/23860883/2943403 and https://stackoverflow.com/q/10859703/2943403 – mickmackusa Oct 19 '21 at 23:27
  • 1
    Thanks, one of the links is helpful and almost does what I need. The last element is a problem but if I manipulate the html before being passed, it should work fine. Will code something later and see how it works on real world data – Darren Oct 20 '21 at 14:18

1 Answers1

0

Thanks to mickmackusa for the links which helped put the piece of the puzzle together.

Using the links, I was able to get the content between each tab opening

<p>{tabs=newtab}</p>

My process was to clean the HTML with tidy, then load it into a new DOMDocument.

use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'show-body-only' => true,
    'drop-empty-paras' => true,
    'wrap'           => 1200
);

// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

$doc = new DOMDocument;
$doc->loadHTML($tidy->value);



$crawler = new Crawler($doc);

The closing tag of the tabs is

<p>{/tabs}</p>

This did not match the code I had and meant it needed some additional processing. As this is a one off project I did a quick fix.

So I crawled the page and added a new paragraph element just BEFORE the closing tabs section. It looks for /tabs within the paragraph, then adds in effect a new tab section with no content.

$crawler
    ->filterXpath('//p[text()[contains(.,"/tabs")]]')
    ->each(function (Crawler $crawler) use ($doc) {
        foreach ($crawler as $node) {
            $span = $doc->createElement('p', '{tab=end}');
            $node->parentNode->insertBefore($span, $node);
        }
    });

This results in the HTML

<p>{tab=end}</p>
<p>{/tabs}</p>

Now I take the edited html provided from $crawler->html() and look for each tab section (starting with <p>{tab=TABNAME}</p> and ending in <p>{tab=NEXTTABNAME}</p>)

I first get the headings

$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
    $matches = [];
    $pattern = '/\{tab=(.*)\}/m';

    if (preg_match($pattern, $node->text(), $matches)) {
        $tab = $matches[1];
    };

    return $tab;
});

I remove the last one (the dummy one I added)

array_pop($tab_headings);

I can now loop through and extract the html, I am using Laravel hence the use of dump

$tab_count = 0;
foreach ($tab_headings as $tab) {
    dump($tab_headings[$tab_count]);
    $first = $tab_count + 1;
    $next = $tab_count + 2;
    /**
     * Get content between tabs
     */
    $tab_content = $crawler
        ->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
        [
        count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        =
        count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
        ]')
        ->each(function ($node) {
            return $node->outerHtml();
        });

    $tab_count++;

    dump($tab_content);
}

I now insert into the database etc..

The links that helped the most

XPath select all elements between two specific elements

XPath: how to select following siblings until a certain sibling

Dharman
  • 30,962
  • 25
  • 85
  • 135
Darren
  • 21
  • 4