-1

I am trying to scrape a bit of html and the structure is coming out like this.

//blockquote

<h2>1. text text</h2>
<p>1. paragraph paragraph</p>
<h2>2. text text</h2>
<p>2. paragraph paragraph</p>
<h2>3. text text</h2>
<p>3a. paragraph paragraph</p>
<p>3b. paragraph paragraph</p>
<h2>4. text text</h2>
<p>4. paragraph paragraph</p>

-- so initially it was hooking into the paragraph tags - but I noticed that some blocks have more than one paragraph. At this point I am unsure how to adjust the explode function I had in place.

$paras = explode("<p>", $paras);

So the final array I need to look something more like this.

array(
"<p>1. paragraph paragraph</p>",
"<p>2. paragraph paragraph</p>",
"<p>3a. paragraph paragraph</p><p>3b. paragraph paragraph</p>",
"<p>4. paragraph paragraph</p>"
):

this is how the code currently looks

foreach($lookuphtml->find('blockquote') as $text) {
            $paras = $text->innertext;
            $paras = explode("<p>", $paras);
        }

//actual contents looks like this

<blockquote><h2 class="left">History</h2><p>Opened October 1997 as the first brewery in Bath since 1956.  The brewery is located in an outbuilding behind Ye Old Farmhouse public house.</p><h2 class="left">Beers Brewed</h2><p>We do not maintain a list of beers brewed by each brewery.  There may be a list on the brewery's own website and we suggest you also visit the entry for  Abbey Ales Ltd on the independent <a href="http://www.beermad.org.uk/brewery/2" rel="external" target="_blank">www.beermad.org.uk</a>.</p><h2 class="left">Regular Outlets</h2><p>The brewery has 4 pubs :</p><p>The Star, 23 Vineyards, Bath, BA1 5NA <br>The Coeur de Lion, Northumberland Place, Bath, BA1 5AR<br>The Foresters, 58 Goose Street, Beckington, Frome, BA11 6SS<br>The Assembly, 16-17 Alfred Street, Bath, BA1 2QU</p><h2 class="left">Visit Information</h2><p>Information on visit availability can be found on the breweries web site.</p><h2 class="left">Brewery Shop Information</h2><p>The brewery does not have a shop, but sells a variety of items via it's web site.</p></blockquote>

...Answer

never mind guys - here is the solution.

foreach($lookuphtml->find('blockquote') as $text) {
    $paras = $text->innertext;

    $paras = preg_replace("/<h2 class=\"left\">(.*?)<\/h2>/", "#~", $paras);
    $pa = explode("#~", $paras);
    $pa2 = array_splice($pa, 1);
}
The Old County
  • 89
  • 13
  • 59
  • 129

1 Answers1

0

Use SimpleXML:

$string = <<<XML
<root>
<h2>1. text text</h2>
<p>1. paragraph paragraph</p>
<h2>2. text text</h2>
<p>2. paragraph paragraph</p>
<h2>3. text text</h2>
<p>3a. paragraph paragraph</p>
<p>3b. paragraph paragraph</p>
<h2>4. text text</h2>
<p>4. paragraph paragraph</p>
</root>
XML;

$xml = simplexml_load_string($string);
$p = (array)($xml->p);

$result = [];
foreach ($p as $item) {
    preg_match('/(\d+)/', $item, $matches);
    $number = isset($matches[0]) ? $matches[0] : $item;
    $result[$number] = isset($result[$number]) ? $result[$number] : '';
    $result[$number] .= '<p>' . $item . '</p>';
}

print_r(array_values($result));

Result is:

php > print_r(array_values($result));
Array
(
    [0] => <p>1. paragraph paragraph</p>
    [1] => <p>2. paragraph paragraph</p>
    [2] => <p>3a. paragraph paragraph</p><p>3b. paragraph paragraph</p>
    [3] => <p>4. paragraph paragraph</p>
)
Nick
  • 9,735
  • 7
  • 59
  • 89
  • -- I get that result -- and that is the problem man -- I need 3.paragraph on 2 ONLY. so [2] => 3. paragraph paragraph 3. paragraph paragraph [3] => 4. paragraph paragraph – The Old County Aug 22 '16 at 09:41
  • Edited. Added `array_unique()` – Nick Aug 22 '16 at 09:45
  • that is still missing the TWO paragraphs – The Old County Aug 22 '16 at 09:45
  • I need the final output here as follows -- Array ( [0] => "

    1. paragraph paragraph

    " [1] => "

    2. paragraph paragraph

    " [2] => "

    3. paragraph paragraph, 3. paragraph paragraph

    " [3] => "

    4. paragraph paragraph

    " )
    – The Old County Aug 22 '16 at 09:46
  • your solution is still not right – The Old County Aug 22 '16 at 09:53
  • ok, just a moment :) – Nick Aug 22 '16 at 09:56
  • -- That won't work man - we are assuming the contents of the paragraphs are the same -- so the paragraph contents for 3 could be like this "

    3. header

    3a. I like to walk amongst the trees. paragraph paragraph

    3b. I want to build a fire. paragraph paragraph

    "
    – The Old County Aug 22 '16 at 10:04
  • ^ your code will fall down in this instance.. you are assuming the "contents" will be the same. – The Old County Aug 22 '16 at 10:05
  • -- the numbers in the paragraphs and headers is only an example -- this is the problem man - we can't rely on the contents being unique - or there being numbers in the pattern.. -- I am going to edit the first block - then you can see what the contents looks like for real – The Old County Aug 22 '16 at 10:14
  • foreach($lookuphtml->find('blockquote') as $text) { $paras = $text->innertext; $paras = preg_replace("/

    (.*?)<\/h2>/", "#~", $paras); $pa = explode("#~", $paras); $pa2 = array_splice($pa, 1); } -- solution man

    – The Old County Aug 22 '16 at 10:36