0

I'm trying to use php to extract content after each h2 tag (and before the next h2 tag)..

Example:

$content = '<h2>title 1</h2>
<ul>
<li>test</li>
<li>test</li>
<li>test</li>
</ul>
<h2>title 2</h2>
<p>testing only</p>
<p>testing only</p>
<p>testing only</p>
<h2>title 3</h2>
<p>testing only</p>
<p>testing only</p>';

To become

[0] => <ul>
<li>test</li>
<li>test</li>
<li>test</li>
</ul>

[1] => <p>testing only</p>
<p>testing only</p>
<p>testing only</p>

[2] => <p>testing only</p>
<p>testing only</p>

I have tried so many different things, too many to list here. I only want to extract the content between the h2 tags, not the h2 tags themselves.

If anyone could please point me in the right direction, or help me out, that would be greatly appreciated!

Thank you.

SoulieBaby
  • 5,405
  • 25
  • 95
  • 145
  • 5
    [PHP Simple HTML DOM Parser](https://simplehtmldom.sourceforge.io/) – Cid Sep 10 '20 at 07:47
  • Agree with @Cid, simple html dom parser is a user friendly tool that should help you get the job done - you can also use composer to get the latest available version: `composer require simplehtmldom/simlehtmldom dev-master`. Also check out [how-to-parse-html-in-php](https://stackoverflow.com/questions/18349130/how-to-parse-html-in-php) – jibsteroos Sep 10 '20 at 07:57
  • Thank you for your help, I'll look into that – SoulieBaby Sep 10 '20 at 08:02

2 Answers2

0

Try this one :)

<?php

    $content = "your content";

    preg_match_all('/(?:<\/h2>)(.*?)(?:<h2>|\z)/s', $content, $match);

    var_dump($match);
?>

Demo -> https://www.phpliveregex.com/p/x7j (select, preg_match_all)

Edit

Note, if you ask yourself why there is an multidimensional array as match result:

  • $matches[0] is an array of full pattern matches

  • $matches[1] is an array of strings matched by the first parenthesized subpattern.

  • $matches[2] is an array of strings matched by the second parenthesized subpattern.

  • (...), and so on

If you want to check if the preg_match_all was successful note to check $match[0] before you proceed. If you want to check you match groups note to check eg. $1 -> $match[1], $2 -> $match[2], $3 -> $match[3], (...) and so on;

If you match multiple times your match groups will contain more than one result.

Example: single match

https://phpsandbox.io/n/icy-term-9wp6

<?php
    $test_string = "Your task XX-123";

    preg_match_all('/task (([A-Z]{1,2})-([0-9]{1,}))/s', $test_string, $match);

    // destruct your array is equal to selection by index $match[$index]
    [$full_match, $match_group_1, $match_group_2, $match_group_3] = $match;

    var_dump($full_match);    // -> ["task XX-123"]
    var_dump($match_group_1); // -> ["XX-123"]
    var_dump($match_group_2); // -> ["XX"]
    var_dump($match_group_3); // -> ["123"]
?>

Example: multiple match

https://phpsandbox.io/n/shy-credit-0ng6

<?php
    $test_string = "Your task XX-123, Your task YZ-456, Your task CD-789";

    preg_match_all('/task (([A-Z]{1,2})-([0-9]{1,}))/s', $test_string, $match);

    // destruct your array is equal to selection by index $match[$index]
    [$full_match, $match_group_1, $match_group_2, $match_group_3] = $match;

    var_dump($full_match);    // -> ["task XX-123", "task YZ-456", "task CD-789"]
    var_dump($match_group_1); // -> ["XX-123", "YZ-456", "CD-789"]
    var_dump($match_group_2); // -> ["XX", "YZ", "CD"]
    var_dump($match_group_3); // -> ["123", "456", "789"]
?>

Example: handle error

https://phpsandbox.io/n/bitter-morning-55gn

<?php

    $test_string = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy";

    preg_match_all('/(no)-(matching)-(pattern)/s', $test_string, $match);

    // get your defined match groups
    $full_match = $match[0];
    $match_group_1 = $match[1];
    $match_group_2 = $match[2];
    $match_group_3 = $match[3];

    // check if your match was successfull
    if (empty($full_match)) {
        // handle error
        print("could not match any result");
        var_dump($match);
    }
    // handle success
    else {
        print("matched something, check $match values for more details");
        var_dump($match_group_1, $match_group_2, $match_group_3);
    }

?>

See php.net docs -> https://www.php.net/manual/en/function.preg-match-all.php

Amin Zoubaa
  • 119
  • 7
  • Thank you, I was so close on this one, but couldn't get it working properly! – SoulieBaby Sep 10 '20 at 09:03
  • It doesn't work properly. See https://phpsandbox.io/n/square-cell-pypp (Run Code). – AbsoluteBeginner Sep 10 '20 at 12:18
  • @AbsoluteBeginner what is wrong? I can't confirm your concern, the result is accurate. ¯\_(ツ)_/¯ – Amin Zoubaa Sep 10 '20 at 15:58
  • Your regex returns two arrays (why?), while in his question OP expects just one (the second). I think this is not accurate. – AbsoluteBeginner Sep 10 '20 at 16:06
  • @AbsoluteBeginner that is also correct.The `preg_match_all` method, has five parameters that can be passed, ($pattern, $subject, $matches, $flags, $offset). The interesting part here is `$flags` If no order flag is given, `PREG_PATTERN_ORDER` is assumed. You always get an array $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on. See: https://www.php.net/manual/en/function.preg-match-all.php – Amin Zoubaa Sep 10 '20 at 18:45
  • @AbsoluteBeginner, see my edit, maybe this could solve your why question? – Amin Zoubaa Sep 10 '20 at 19:28
-1

This is my suggestion:

$content = '<h2>title 1</h2>
<ul>
<li>test</li>
<li>test</li>
<li>test</li>
</ul>
<h2>title 2</h2>
<p>testing only</p>
<p>testing only</p>
<p>testing only</p>
<h2>title 3</h2>
<p>testing only</p>
<p>testing only</p>';

$content = preg_replace('/<h2>(.*?)<\/h2>/s', '|', $content);
$content = explode('|', $content);
$content = array_map('trim', array_values(array_filter($content)));

// var_dump($content);

It returns just one array, as requested by OP.

I'm sure it can be improved. But I think it's a good starting point.

AbsoluteBeginner
  • 2,160
  • 3
  • 11
  • 21
  • It is too complex. The sourcing `$context` is manipulated 3 times and changing his type from `String` -> to an 'Array'. You will loose your original content and should not manipulate input values, maybe this could be an argument in a method. Also critical is that thy symbol `|` can break the output, when pipe is used inside the content. Another performance issue can occur when you have big strings that contains much more than a test string, maybe they would cause slow performance when calling 6 functions ```preg_replace -> explode -> array_filter -> array_values -> array_map -> trim``` – Amin Zoubaa Sep 10 '20 at 21:56
  • I agree with you that my solution could seem "too complex" (as I wrote, it can be improved - are you able to do it?). But it gives exactly what OP requested (read again his question). The symbol ```|``` can be replaced with any other symbol. Your other objections are purely hypothetical ("can acour ... maybe they would cause slow performance ..."). No offense, I do not think your solution is valid. Despite this, I did not downvote it. – AbsoluteBeginner Sep 10 '20 at 22:12