0

Greetings All

I am trying to get the values in the 4th column from the left for this url. I can get all the values but it skips the first one (e.g. 30 i think is the value on top right now )

My regex is

~<td align="center" class="row2">.*<a href="javascript:who_posted.*;">([\d,]+)</a>.*</td>~isU

NOTE: HTML PARSING IS NOT AN OPTION RIGHT NOW AS THIS IS PART OF A HUGE SYSTEM AND CANNOT BE CHANGED

Thanking you Imran

Imran Omar Bukhsh
  • 7,849
  • 12
  • 59
  • 81
  • 3
    I would use a proper HTML parser. See http://stackoverflow.com/questions/3577641/best-methods-to-parse-html – Pekka May 01 '11 at 09:46
  • 2
    @Pekka - Thank you for posting *this* link, and not *that* link. – Kobi May 01 '11 at 10:01
  • 2
    @Kobi yeah. *that* link is a legend, but it's not really that productive. If it weren't so sacrilegious, I'd add a collection of links to it – Pekka May 01 '11 at 10:02

1 Answers1

3

You could just use:

/<a href="javascript:who_posted\(\d+\);?">([\d,]+)</a>/

As the javascript function can be exploited as a "regex selection point"


If you want your regex to work you need to use non-greedy expression, i.e. change .* to .*?

Also your first align match attribute in the HTML is surrounded in '' quotation marks, not "" in the HTML, for some weird inconsistent reason. Try this:

   |<td align=["\']center["\'] class="row2">.*?<a href="javascript:who_posted[^"]+">([\d,]+)</a>.*?</td>|is

Edit:

$a = file_get_contents('http://www.zajilnet.com/forum/index.php?showforum=31');

preg_match_all('|<td align=["\']center["\'] class="row2">.*?<a href="javascript:who_posted[^"]+">([\d,]+)</a>.*?</td>|is',$a,$m);

print_r($m[1]);

Result:

Array
(
    [0] => 30
    [1] => 16
    [2] => 56
    [3] => 14
    [4] => 96
    [5] => 4
    [6] => 0
    [7] => 17
  [.... and more....]
Gary Green
  • 22,045
  • 6
  • 49
  • 75
  • @Gary, if the problem is greedy-ness... shouldn't all of his queries failed, and not just the first line as he mentioned? – Mike Pennington May 01 '11 at 10:10
  • 1
    @Mike when I tested his regex, it only had one match -- the second row. It also started from the second row because of the class HTML quotation inconsistency. – Gary Green May 01 '11 at 10:14
  • @Mike : don't seem to work, there should be only one selection point – Imran Omar Bukhsh May 01 '11 at 10:22
  • Ah interesting, I didn't know about the `/U` PCRE modifier, making all quantifiers ungreedy by default. Anyway, with a few adjustments this works now. See updated answer. – Gary Green May 01 '11 at 10:50
  • @Gary Green : hey align=["']center["'] worked! Thanx. One more question though. How did you spot the inconsistency in the class HTML quotation? I could not spot it either in firefox for chrome or even using wget on linux – Imran Omar Bukhsh May 01 '11 at 10:53
  • It's right there: http://i54.tinypic.com/33mr7e8.jpg ;-) It's actually an align inconsitency, boop! – Gary Green May 01 '11 at 10:57
  • @Gary Green : ah thanx! firebug and chrome's developer tools weren't showing that. Had to 'view source'!! – Imran Omar Bukhsh May 01 '11 at 11:03
  • 1
    @Imran when creating regex NEVER look at Firebug or any of those developer tools. They will often "normalize" the HTML and won't show you any vital information in parsing; speech marks used, correct attribute order, newlines, tabs, etc. They ARE useful for creating CSS selectors though. – Gary Green May 01 '11 at 11:27