1

I'm trying to extract the words within the <li> </li> tags below. My regex is working well, but only giving me the first <li>, Lorem ipsum...

I'm reasonably new to regex, and I am aware it would be likely more reliable to do this by traversing the DOM, but in this case regex is prefered. Any ideas what I need to change to get all the results, instead of just the one?

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/

<div class="foo-bar">
    <!-- Other junk -->
    <ul>
        <li>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        </li>
        <li>
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        </li>
        <li>
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        </li>
        <li>
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        </li>
    </ul>
    <!-- Other junk -->
</div>
Eamonn
  • 418
  • 3
  • 11
  • What does your PHP code look like? – nerdlyist Feb 06 '17 at 16:50
  • Don't use regex. Use a parser. http://php.net/manual/en/domdocument.getelementsbytagname.php To do it with regex you'd need to pull the full `ul` then parse out each `li`. – chris85 Feb 06 '17 at 16:51
  • Doesn't exist yet, just prototyping the regex. Need to fiddle just a snippet as above. – Eamonn Feb 06 '17 at 16:51
  • See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – TomWilsonFL Feb 06 '17 at 17:02
  • @TomWilsonFL "While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML." – Eamonn Feb 06 '17 at 17:05
  • 2
    I have read it also. :) I still think it is apt for your question because you may be asking a single Regex to do too much. – TomWilsonFL Feb 06 '17 at 17:14

3 Answers3

1

Use DOM+Xpath not RegEx.

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//div[@class="foo-bar"]/ul/li') as $li) {
  var_dump($li->textContent);
}

Output:

string(80) "
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        "
string(75) "
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        "
string(95) "
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        "
string(89) "
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        "
ThW
  • 19,120
  • 3
  • 22
  • 44
  • Great answer otherwise though. – Eamonn Feb 07 '17 at 10:25
  • I did. You might not WANT to use an XML parser, but it is the much better solution. So I posted the answer more for others that might have the same problem and find your question. – ThW Feb 07 '17 at 12:31
0

Add the global g flag at the end. For example:

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/g

You may also want the i flag for case-insensitive

Andy
  • 698
  • 12
  • 22
  • Yes, but other `
  • ` exist in the document, thus needing the `
    ` wrapper.
  • – Eamonn Feb 06 '17 at 17:03