0

I've read up on this a good amount around Stack Overflow and its sister-sites and I understand it isn't the best practice to use regex to parse through HTML. I'm not trying to do any serious parsing or very specific parsing, just grab a few repeating elements in a couple page that are very consistent. Then from those elements, I will perform other web scraping tasks.

My general question lies in the fact that I'm trying to grab elements, both opening and closing. (Specifically in this instance a set 'li' elements)

<li id="result_0" data-asin="<8 char hash>"> ........ </li>
<li id="result_1" data-asin="<8 char hash>"> ........ </li>
<li id="result_2" data-asin="<8 char hash>"> ........ </li>
<li id="result_3" data-asin="<8 char hash>"> ........ </li>
<li id="result_4" data-asin="<8 char hash>"> ........ </li>
....
<li id="result_15" data-asin="<8 char hash>"> ........ </li>
<li id="result_16" data-asin="<8 char hash>"> ........ </li>
<li id="result_17" data-asin="<8 char hash>"> ........ </li>
...

The code I'm using is (PHP):

$pattern = '/[<][l][i]\s[i][d][=]["][a-z]{6}[_][0-9]{1,2}[^li]+/';
$matches = array();
$topics = array();
preg_match_all($pattern, $source, $matches);
var_dump($matches);

and $matches returns

array (size=1)
    0 => 
        array (size=28)
              0 => string '<li id="result_0" data-as' (length=25)
              1 => string '<li id="result_1" data-as' (length=25)
              2 => string '<li id="result_2" data-as' (length=25)
              3 => string '<li id="result_3" data-as' (length=25)
 ......
 ......

I know I'm stopping at the 'i' in data-asin because of the [^li] but I'm not sure how to say: accept line breaks and all characters except for "</li>"

Note: Between the LI element there is no other LI elements to screw up looking for a closing LI element

Also the:

[<][l][i]\s[i][d][=]["]

beginning to my pattern looks like trash. Is there a way to group up literal text and search for it? (ex: look for -> "<li id='") I'm assuming this will lead me to searching for my "</li>" as well.

And for the last </li>, how do I say search for everything UNTIL </li>?

halfer
  • 19,824
  • 17
  • 99
  • 186
domdambrogia
  • 2,054
  • 24
  • 31

6 Answers6

2

You'd really really be better off using a parser and some xpath queries instead, e.g. to grab all your list items you'd only need two lines:

$xml = simplexml_load_file($url);
$items = $xml->xpath("//li[starts-with(@id, 'result_')]");
foreach ($items as $item) {
    // do sth. with the item
}

Especially when your data-asin attributes contain < and >.

Jan
  • 42,290
  • 8
  • 54
  • 79
1

I've preface this with that I'm not familiar with PHP, but regular expressions are generally the same or similar across languages regardless.

Simplified Pattern: /<li id="result_\d+" data-asin=".{8}">[^<]+<\/li>/

This could be simplified further if you just want to blindly grab all li tags regardless of id or data-asin attributes.

michael
  • 748
  • 4
  • 10
  • That regex would match any ID that started with `result_` and ended with digits (0-9). If the ID was something like `result_34test`, then that regex would NOT match that `li` tag. The regex above uses literal characters for most of the match, only the `\d+`, `.{8}`, and `[^<]+` are dynamically evaluated, everything else is taken for its literal value. The forward slash in the closing `li` tag is escaped with a preceding back slash. Hopefully I understood your comment correctly, if not, please feel free to further explain / ask. – michael Jan 30 '16 at 00:41
1

Regex of the sort

<(li|ol|otherelement)[\s\S]+?<\/(\1)>

in the first () you can put all elements you want your regex to find and the (\1) backreference will make sure to match their closing tags. [\s\S]+? is basically all characters, even new line, one or more with ? - which makes it lazy to make sure to capture the first possible closing tag of that element type.

Luchiro
  • 53
  • 5
1
<li id="result_0" data-asin="<8 char hash>"> ........ </li>

~\Q<li id="\E([^"]*)\Q" data-asin="\E([a-zA-Z]{8})\Q">\E(.*)\Q</li>\E~

https://regex101.com/r/lI0zR5/1

hakre
  • 193,403
  • 52
  • 435
  • 836
0

An easier pattern

(?<=li id\=).*(?=\<\/li\>)
Nefariis
  • 3,451
  • 10
  • 34
  • 52
0

The best advice I can give you is to read a regex tutorial to understand what is wrong with your regex approach. Otherwise to obtain what you want, searching html like plain text with regex isn't the good way. Use the html structure:

$dom = new DOMDocument;
$dom->loadHTML($html);

$lis = $dom->getElementsByTagName('li');

foreach($lis as $li) {
    if (preg_match('/^[a-z]{6}_[0-9]{1,2}$/', $li->getAttribute('id')))
        echo $dom->saveHTML($li) . PHP_EOL;
}
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • I usually really love your regex solutions (and I'm learning *a lot*), however in this particular situation, xpath queries seem more appropriate, don't you think? – Jan Jan 29 '16 at 22:35
  • @Jan: since XPath 1.0 doesn't support regex, and since `starts-with(., 'result_')` is too restrictive and doesn't check the digits, I have choosen to use preg_match to stay more general. Obviously you can use the `DOMXPath::registerPHPFunction` method to obtain a more precise result, but I prefer a not too complicated answer. – Casimir et Hippolyte Jan 29 '16 at 22:38