3

I have the following code that I'm trying to get fixed.

The code:

$pageData = file_get_contents('111234-2.html');
if(preg_match_all('/<a\s+onclick=["\']([^"\']+)["\']/i', $pageData, $links, PREG_PATTERN_ORDER))
     print_r(array_unique($links[1]));
return false;

Some sample HTML where I want it to fetch from:

    <a onclick="doShowCHys=1;ShowWindowN(0,'http://www.example.com/home/Player.aspx?lpk4=116031&amp;playChapter=False',960,540,111234);return false;" href="javascript:void(0);">
<span class="vt">Welcome

        </span>
        <span class="dur">1m 10s</span>
        <span class="" id="bkmimgview-116031">&nbsp;</span>
        <br class="clear">
    </a>

The output I am getting:

Array ( [0] => doShowCHys=1;ShowWindowN(0, )

The output I am hoping for:

Array ( [0] => doShowCHys=1;ShowWindowN(0,'http://www.example.com/home/Player.aspx?lpk4=116031&amp;playChapter=False',960,540,111234);return false;)

How do I achieve this?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Ryan
  • 9,821
  • 22
  • 66
  • 101

1 Answers1

4

You can improve this using a backreference but you're pretty much doomed if there's any more levels of nested quotes.

'/<a\s+onclick=(["\'])((?:(?!\1).)+)\1/i'

The backreference lets you refer to an already-captured group. So, if you caught a " in the first capture, then you want to find a string of non-"s, and likewise, if you caught a ' in the first capture, then you want to find a string of non-'s, and either way end up with that same quote, " or ' respectively.

EDIT:

@vladr offers a much nicer alternative:

'/<a\s+onclick=(["\'])(.*?)\1/i'

Same idea but the non-greedy quantifier makes it unnecessary to test every character for non-whatever-quote-ness. Updated Rubular link: http://rubular.com/r/VXR1nQ4zf5.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145