I've read up on this a good amount around Stack Overflow and its sister-sites and I understand it isn't the best practice to use regex to parse through HTML. I'm not trying to do any serious parsing or very specific parsing, just grab a few repeating elements in a couple page that are very consistent. Then from those elements, I will perform other web scraping tasks.
My general question lies in the fact that I'm trying to grab elements, both opening and closing. (Specifically in this instance a set 'li' elements)
<li id="result_0" data-asin="<8 char hash>"> ........ </li>
<li id="result_1" data-asin="<8 char hash>"> ........ </li>
<li id="result_2" data-asin="<8 char hash>"> ........ </li>
<li id="result_3" data-asin="<8 char hash>"> ........ </li>
<li id="result_4" data-asin="<8 char hash>"> ........ </li>
....
<li id="result_15" data-asin="<8 char hash>"> ........ </li>
<li id="result_16" data-asin="<8 char hash>"> ........ </li>
<li id="result_17" data-asin="<8 char hash>"> ........ </li>
...
The code I'm using is (PHP):
$pattern = '/[<][l][i]\s[i][d][=]["][a-z]{6}[_][0-9]{1,2}[^li]+/';
$matches = array();
$topics = array();
preg_match_all($pattern, $source, $matches);
var_dump($matches);
and $matches returns
array (size=1)
0 =>
array (size=28)
0 => string '<li id="result_0" data-as' (length=25)
1 => string '<li id="result_1" data-as' (length=25)
2 => string '<li id="result_2" data-as' (length=25)
3 => string '<li id="result_3" data-as' (length=25)
......
......
I know I'm stopping at the 'i' in data-asin because of the [^li] but I'm not sure how to say: accept line breaks and all characters except for "</li>"
Note: Between the LI element there is no other LI elements to screw up looking for a closing LI element
Also the:
[<][l][i]\s[i][d][=]["]
beginning to my pattern looks like trash. Is there a way to group up literal text and search for it? (ex: look for -> "<li id='") I'm assuming this will lead me to searching for my "</li>" as well.
And for the last </li>, how do I say search for everything UNTIL </li>?