Thanks to mickmackusa for the links which helped put the piece of the puzzle together.
Using the links, I was able to get the content between each tab opening
<p>{tabs=newtab}</p>
My process was to clean the HTML with tidy, then load it into a new DOMDocument.
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('E:\Dropbox\laragon\www\scrape\description.txt');
$config = array(
'indent' => true,
'output-xhtml' => true,
'show-body-only' => true,
'drop-empty-paras' => true,
'wrap' => 1200
);
// Tidy
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$doc = new DOMDocument;
$doc->loadHTML($tidy->value);
$crawler = new Crawler($doc);
The closing tag of the tabs is
<p>{/tabs}</p>
This did not match the code I had and meant it needed some additional processing. As this is a one off project I did a quick fix.
So I crawled the page and added a new paragraph element just BEFORE the closing tabs section. It looks for /tabs within the paragraph, then adds in effect a new tab section with no content.
$crawler
->filterXpath('//p[text()[contains(.,"/tabs")]]')
->each(function (Crawler $crawler) use ($doc) {
foreach ($crawler as $node) {
$span = $doc->createElement('p', '{tab=end}');
$node->parentNode->insertBefore($span, $node);
}
});
This results in the HTML
<p>{tab=end}</p>
<p>{/tabs}</p>
Now I take the edited html provided from $crawler->html() and look for each tab section (starting with <p>{tab=TABNAME}</p>
and ending in <p>{tab=NEXTTABNAME}</p>
)
I first get the headings
$tab_headings = $crawler->filterXpath('//p[text()[contains(.,"tab=")]]')->each(function ($node) {
$matches = [];
$pattern = '/\{tab=(.*)\}/m';
if (preg_match($pattern, $node->text(), $matches)) {
$tab = $matches[1];
};
return $tab;
});
I remove the last one (the dummy one I added)
array_pop($tab_headings);
I can now loop through and extract the html, I am using Laravel hence the use of dump
$tab_count = 0;
foreach ($tab_headings as $tab) {
dump($tab_headings[$tab_count]);
$first = $tab_count + 1;
$next = $tab_count + 2;
/**
* Get content between tabs
*/
$tab_content = $crawler
->filterXpath('//p[text()[contains(.,"tab=")]][' . $first . ']/following-sibling::*
[
count(.|//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
=
count(//p[text()[contains(.,"tab=")]][' . $next . ']/preceding-sibling::*)
]')
->each(function ($node) {
return $node->outerHtml();
});
$tab_count++;
dump($tab_content);
}
I now insert into the database etc..
The links that helped the most
XPath select all elements between two specific elements
XPath: how to select following siblings until a certain sibling