1

Greetings everyone

I have this regular expression which goes as follow:

$thread_views_exp = '~<td class="alt1" align="center">.*</td> <td class="alt2" align="center">(.*)</td> </tr>~isU';

The purpose of this is to get all the 'views' ( first column from left ) for this sample thread url http://www.swalif.net/softs/swalif45. Everything works fine except for the first value.

Sample Output:

Array
(
    [0] => 12 528
    [1] => 2,732
    [2] => 506
    [3] => 73
    [4] => 83
    [5] => 245
    [6] => 100
    [7] => 201
    [8] => 55
    [9] => 55
    [10] => 37
    [11] => 349
    [12] => 123
    [13] => 75
    [14] => 173
    [15] => 260
    [16] => 101
    [17] => 660
    [18] => 158
    [19] => 66
    [20] => 177
    [21] => 165
    [22] => 228
    [23] => 812
    [24] => 347
    [25] => 197
    [26] => 348
    [27] => 263
    [28] => 176
    [29] => 315
    [30] => 173
    [31] => 273
    [32] => 199
)

Thanks for your assistance. Imran

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Imran Omar Bukhsh
  • 7,849
  • 12
  • 59
  • 81
  • 1
    Don't [parse html with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). So don't try... – ircmaxell Feb 10 '11 at 11:48
  • Please don't stuff your ideas on to others. Let everyone have their own view about it. I am using it in my context and have been very successful with 90% of my work. This is just one small thing which is stuck if you can help its alrite, otherwise no need to comment. – Imran Omar Bukhsh Feb 10 '11 at 11:52
  • Are the table cells in right-to-left arrangement too by any chance? Got me quite confused. The cause of the extraneous text content is not really obvious. Maybe you should post a `tidy -i` reformatted source example. – mario Feb 10 '11 at 11:55
  • @Russell - the above issue solved will make it 100% – Imran Omar Bukhsh Feb 10 '11 at 11:58
  • @Imran: It will not work in the general case. Why not just use an actual HTML parser? It'll be far easier to maintain, far more efficient, and far more robust. What happens if you feed `<![CDATA[]]>FooBar` in. You'll miss the foobar bit... It's better to use a html parser and be done than try to make something work that can't (in the general sense at least)... – ircmaxell Feb 10 '11 at 11:58
  • @mario: sorry i did not get you 'tidy -i' ?? – Imran Omar Bukhsh Feb 10 '11 at 11:59
  • 3
    If you have some controll over the generated HTML or always get it from a given source you could very well use RegEX as you can anticipate any hickups, but for arbitrary HTML, then go with a parser as there can be to many special cases for regex to handle them all. – David Mårtensson Feb 10 '11 at 12:03
  • Can't you just extract it from the database? You take the numbers from *somewhere* in the first place, don't you? – mingos Feb 10 '11 at 12:11
  • @David - that is exactly my case – Imran Omar Bukhsh Feb 10 '11 at 12:29

2 Answers2

4

It seems to be a case of table cell greedyness. My test also gave me an extraneous <td>. But there is a simple way to make the regex more stringent:

$rx = '~<td class="alt1" align="center">.*</td> <td class="alt2" align="center">([\d,]+)</td> </tr>~isU';

Here the \d+ used in place of .*? returns only exact matches. The previous .* was eating up too much.

General tip: you might want to use [^<>]* for safely matching text content between html brackets, instead of .*. Maybe apply \s+ instead of just spaces.

mario
  • 144,265
  • 20
  • 237
  • 291
0

Maybe try

~<td class="alt2" [^\<\>]+?>([\d,]+)</td>~isU

This assumes that the tds you are interested in are always of class="alt2"

And there's probably no need to escape the LT and GT signs ie...

~<td class="alt2" [^<>]+?>([\d,]+)</td>~isU
El Ronnoco
  • 11,753
  • 5
  • 38
  • 65