Grabbing elements from HTML using Regex

Question

I've read up on this a good amount around Stack Overflow and its sister-sites and I understand it isn't the best practice to use regex to parse through HTML. I'm not trying to do any serious parsing or very specific parsing, just grab a few repeating elements in a couple page that are very consistent. Then from those elements, I will perform other web scraping tasks.

My general question lies in the fact that I'm trying to grab elements, both opening and closing. (Specifically in this instance a set 'li' elements)

<li id="result_0" data-asin="<8 char hash>"> ........ </li>
<li id="result_1" data-asin="<8 char hash>"> ........ </li>
<li id="result_2" data-asin="<8 char hash>"> ........ </li>
<li id="result_3" data-asin="<8 char hash>"> ........ </li>
<li id="result_4" data-asin="<8 char hash>"> ........ </li>
....
<li id="result_15" data-asin="<8 char hash>"> ........ </li>
<li id="result_16" data-asin="<8 char hash>"> ........ </li>
<li id="result_17" data-asin="<8 char hash>"> ........ </li>
...

The code I'm using is (PHP):

$pattern = '/[<][l][i]\s[i][d][=]["][a-z]{6}[_][0-9]{1,2}[^li]+/';
$matches = array();
$topics = array();
preg_match_all($pattern, $source, $matches);
var_dump($matches);

and $matches returns

array (size=1)
    0 => 
        array (size=28)
              0 => string '<li id="result_0" data-as' (length=25)
              1 => string '<li id="result_1" data-as' (length=25)
              2 => string '<li id="result_2" data-as' (length=25)
              3 => string '<li id="result_3" data-as' (length=25)
 ......
 ......

I know I'm stopping at the 'i' in data-asin because of the [^li] but I'm not sure how to say: accept line breaks and all characters except for "</li>"

Note: Between the LI element there is no other LI elements to screw up looking for a closing LI element

Also the:

[<][l][i]\s[i][d][=]["]

beginning to my pattern looks like trash. Is there a way to group up literal text and search for it? (ex: look for -> "<li id='") I'm assuming this will lead me to searching for my "</li>" as well.

And for the last </li>, how do I say search for everything UNTIL </li>?

This is REQUIRED READING for this kind of question: http://stackoverflow.com/a/1732454/18157 — Jim Garrison, Jan 29 '16 at 22:22

score 2 · Answer 1 · answered Jan 29 '16 at 22:32

You'd really really be better off using a parser and some xpath queries instead, e.g. to grab all your list items you'd only need two lines:

$xml = simplexml_load_file($url);
$items = $xml->xpath("//li[starts-with(@id, 'result_')]");
foreach ($items as $item) {
    // do sth. with the item
}

Especially when your data-asin attributes contain < and >.

score 1 · Accepted Answer · answered Jan 29 '16 at 22:26

1

I've preface this with that I'm not familiar with PHP, but regular expressions are generally the same or similar across languages regardless.

Simplified Pattern: /<li id="result_\d+" data-asin=".{8}">[^<]+<\/li>/

This could be simplified further if you just want to blindly grab all li tags regardless of id or data-asin attributes.

answered Jan 29 '16 at 22:26

michael

748
4
10

– domdambrogia Jan 29 '16 at 23:12
That regex would match any ID that started with `result_` and ended with digits (0-9). If the ID was something like `result_34test`, then that regex would NOT match that `li` tag. The regex above uses literal characters for most of the match, only the `\d+`, `.{8}`, and `[^<]+` are dynamically evaluated, everything else is taken for its literal value. The forward slash in the closing `li` tag is escaped with a preceding back slash. Hopefully I understood your comment correctly, if not, please feel free to further explain / ask. – michael Jan 30 '16 at 00:41

score 1 · Answer 3 · answered Jan 29 '16 at 22:38

Regex of the sort

<(li|ol|otherelement)[\s\S]+?<\/(\1)>

in the first () you can put all elements you want your regex to find and the (\1) backreference will make sure to match their closing tags. [\s\S]+? is basically all characters, even new line, one or more with ? - which makes it lazy to make sure to capture the first possible closing tag of that element type.

score 1 · Answer 4 · answered Jan 29 '16 at 22:44

1

<li id="result_0" data-asin="<8 char hash>"> ........ </li>

~\Q<li id="\E([^"]*)\Q" data-asin="\E([a-zA-Z]{8})\Q">\E(.*)\Q</li>\E~

https://regex101.com/r/lI0zR5/1

answered Jan 29 '16 at 22:44

hakre

193,403
52
435
836

score 0 · Answer 5 · answered Jan 29 '16 at 22:27

0

An easier pattern

(?<=li id\=).*(?=\<\/li\>)

answered Jan 29 '16 at 22:27

Nefariis

3,451
10
34
52

Casimir et Hippolyte · Answer 6 · 2016-01-29T22:43:00.973

0

The best advice I can give you is to read a regex tutorial to understand what is wrong with your regex approach. Otherwise to obtain what you want, searching html like plain text with regex isn't the good way. Use the html structure:

$dom = new DOMDocument;
$dom->loadHTML($html);

$lis = $dom->getElementsByTagName('li');

foreach($lis as $li) {
    if (preg_match('/^[a-z]{6}_[0-9]{1,2}$/', $li->getAttribute('id')))
        echo $dom->saveHTML($li) . PHP_EOL;
}

edited Jan 29 '16 at 22:43

answered Jan 29 '16 at 22:34

Casimir et Hippolyte

88,009
5
94
125

I usually really love your regex solutions (and I'm learning *a lot*), however in this particular situation, xpath queries seem more appropriate, don't you think? – Jan Jan 29 '16 at 22:35
@Jan: since XPath 1.0 doesn't support regex, and since `starts-with(., 'result_')` is too restrictive and doesn't check the digits, I have choosen to use preg_match to stay more general. Obviously you can use the `DOMXPath::registerPHPFunction` method to obtain a more precise result, but I prefer a not too complicated answer. – Casimir et Hippolyte Jan 29 '16 at 22:38

Grabbing elements from HTML using Regex

6 Answers6