Extract multiple from html

Question

I'm trying to extract the words within the <li> </li> tags below. My regex is working well, but only giving me the first <li>, Lorem ipsum...

I'm reasonably new to regex, and I am aware it would be likely more reliable to do this by traversing the DOM, but in this case regex is prefered. Any ideas what I need to change to get all the results, instead of just the one?

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/

<div class="foo-bar">
    <!-- Other junk -->
    <ul>
        <li>
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        </li>
        <li>
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        </li>
        <li>
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        </li>
        <li>
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        </li>
    </ul>
    <!-- Other junk -->
</div>

Don't use regex. Use a parser. http://php.net/manual/en/domdocument.getelementsbytagname.php To do it with regex you'd need to pull the full `ul` then parse out each `li`. — chris85, Feb 06 '17 at 16:51
Doesn't exist yet, just prototyping the regex. Need to fiddle just a snippet as above. — Eamonn, Feb 06 '17 at 16:51
See: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — TomWilsonFL, Feb 06 '17 at 17:02
@TomWilsonFL "While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML." — Eamonn, Feb 06 '17 at 17:05
I have read it also. :) I still think it is apt for your question because you may be asking a single Regex to do too much. — TomWilsonFL, Feb 06 '17 at 17:14

score 1 · Accepted Answer · answered Feb 07 '17 at 09:23

1

Use DOM+Xpath not RegEx.

$document = new DOMDocument();
$document->loadHTML($html);
$xpath = new DOMXpath($document);

foreach($xpath->evaluate('//div[@class="foo-bar"]/ul/li') as $li) {
  var_dump($li->textContent);
}

Output:

string(80) "
            Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        "
string(75) "
            Vestibulum iaculis nibh ac orci imperdiet ultrices.
        "
string(95) "
            Fusce neque lacus, feugiat eget sapien eget, ullamcorper rutrum mauris.
        "
string(89) "
            Maecenas in ipsum consectetur, finibus ex et, condimentum turpis.
        "

answered Feb 07 '17 at 09:23

ThW

19,120
3
22
44

Great answer otherwise though. – Eamonn Feb 07 '17 at 10:25
I did. You might not WANT to use an XML parser, but it is the much better solution. So I posted the answer more for others that might have the same problem and find your question. – ThW Feb 07 '17 at 12:31

score 0 · Answer 2 · answered Feb 06 '17 at 16:52

0

Add the global g flag at the end. For example:

/<div class="foo-bar">[\s\S]+<ul>[\s\S]*?(<li>([\s\S]*?)<\/li>)+[\s\S]*?<\/ul>/g

You may also want the i flag for case-insensitive

answered Feb 06 '17 at 16:52

Andy

698
12
22

There is no `g` in PHP. http://php.net/manual/en/reference.pcre.pattern.modifiers.php The functions are global or not. – chris85 Feb 06 '17 at 16:52
@chris85 instead of `g` you can use the `preg_match_all()` function – funilrys Feb 06 '17 at 16:54
@funilrys Yea, `The functions are global or not.` There still is no `g` modifier though. – chris85 Feb 06 '17 at 16:55
@funilrys Yup `preg_match_all()` still only matches one. :( – Eamonn Feb 06 '17 at 16:56
1

@Eamonn Yes, this isn't the answer. The regex won't work as you want. – chris85 Feb 06 '17 at 16:56
Sorry, my mistake. If you just want the data between the `
`s, could you not just use `\(
([\s\S]*?)<\/li>)+\i`

Andy

Feb 06 '17 at 17:00

Yes, but other `

` exist in the document, thus needing the `

` wrapper.

– Eamonn Feb 06 '17 at 17:03

funilrys · Answer 3 · 2017-02-06T17:24:08.250

0

It'll be better to use the following with preg_match_all(). I just tested it here and it's working.

First preg_match_all the following to get only the content of the `

/<div class="foo-bar">([\s\S]*?)+<ul>([\s\S]*?)<\/ul>([\s\S]*?)<\/div>/

Then preg_match_all the result of the previous preg_match_all with the following to only get the <li> contents

/<li>([\s\S]*?)<\/li>/

edited Feb 06 '17 at 17:24

answered Feb 06 '17 at 17:03

funilrys

787
9
20

This is what I need, but it also needs the `
` and `
` wrappers to stop it matching other things.
– Eamonn Feb 06 '17 at 17:10
1

@Eamonn This, I believe, is impossible in a single Regex. Break it into two Regexps? – TomWilsonFL Feb 06 '17 at 17:12
Looks like that's going to be the solution. If you want to update your answer to that I'll accept it. – Eamonn Feb 06 '17 at 17:16
@Eamonn Edited my answer can you test it ? – funilrys Feb 06 '17 at 17:24

Extract multiple from html

3 Answers3