-1

I need a regular expression pattern all characters including whitespace what is not a variable in PHP.

<li class="xyz" data-name="abc">
    <span id="XXX">some words</span>
    <div data-attribute="values">
        <a class="klm" href="http://example.com/blabla">somethings</a>
    </div>
    <div class="xyz sub" data-name="abc-sub"><a href="http://www.example.com/blabla/images"><img src="/images/any_image.jpg" class="qqwwee"></a></div>
</li><!--repeating li tags-->

I wrote a pattern;

preg_match_all('#<li((?s).*?)<div((?s).*?)href="((?s).*?)"((?s).*?)</li>#', $subject, $matches);

This works well but I don't want to get four variables. I just want to get

http://example.com/blabla

And anyone can tell me why this does not work like that?

preg_match_all('#<li[[?s].*?]<div[[?s].*?]href="((?s).*?)"[[?s].*?]</li>#', $subject, $matches);
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
zenon
  • 136
  • 1
  • 10

2 Answers2

1

Using (?:) will allow grouping but make those groups not captured, for example, the following:

#<li(?:(?s).*?)<div(?:(?s).*?)href="((?s).*?)"(?:(?s).*?)</li>#

Will output:

array (
  0 => 
  array (
    0 => '<li class="xyz" data-name="abc">
    <span id="XXX">some words</span>
    <div data-attribute="values">
        <a class="klm" href="http://example.com/blabla">somethings</a>
    </div>
    <div class="xyz sub" data-name="abc-sub"><a href="http://www.example.com/blabla/images"><img src="/images/any_image.jpg" class="qqwwee"></a></div>
</li>',
  ),
  1 => 
  array (
    0 => 'http://example.com/blabla',
  ),
)

All of your matches will be contained in $matches[1], so iterate through that.

Malekai
  • 4,765
  • 5
  • 25
  • 60
Paul
  • 646
  • 4
  • 13
0

Don't use RegExps to parse HTML

Read this famous answer on StackOverflow.

HTML is not a regular language, so it cannot be reliably processed with a RegExp. Instead, use a proper (and robust) HTML parser.

Also note that data mining (analysis) != (data collection).

If you don't want a regexp group to store the "captured" data, use a non-capturing flag.

(?:some-complex-regexp-here)

In your case, the following may work:

(?s)<li.*?<div.*?href="([^"]*?)".*?</li>

But seriously, don't use regexps for this; regexps are fragile. Use an like /li//div//a//@href instead.

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194