2

I'm decent at PHP(far from an expert) but a pure novice when it comes to regexp and scraping. I wanted to do a little bit of scraping to help with some research and to educate myself, but I've ran into a problem. I want to extract prize from the following part of a page:

<th valign="top"> Prize pool:
</th>
<td> $75,000
</td></tr>

Needless to say, the prize pool value will change. I want to get the prize, and only the prize from this part (in this example the script should print out $75,000).

This is what I have so far:

preg_match('/Prize pool:\n<\/th>\n<td>(.*)/i', $file_string, $prize);

However, this prints out:

Prize pool:
</th> 
<td> $75,000
Anders
  • 37
  • 3
  • 2
    While @JohnConde's comment is quite true, a better answer here is that you should use something like http://php.net/domdocument. – David Kiger Feb 25 '13 at 12:57
  • 1
    put `//th[contains(text(), 'Prize pool')]/td` into https://gist.github.com/1358174 – Gordon Feb 25 '13 at 12:59
  • If the value is always going to be a dollar sign followed by numbers, could you not just search for the dollar and any numbers / commas after? – Matt Feb 25 '13 at 13:01
  • @gordon, maybe you meant: `//th[contains(text(), 'Prize pool')]/following-sibling::td` ? – pguardiario Feb 26 '13 at 00:04

3 Answers3

1
preg_match('/Prize pool:.+(\$\d+(?:\.|,)\d+)/is', $file_string, $prize);
echo '<pre>' . print_r($prize, 1) . '</pre>';

Like this.

A little explanation

. - to search for any single character, but not new line char "\n"

+ - means one or more repetitions

So, .+ means that after "Prize pool:" must be more than one any char

(...) It is called a pocket. Each pocket in regex will be located in a each element of array ($prize)

$ in patter means as end of line, therefore we need conversion it in single char by escaping it like this \$

\d - means one number from 0 to 9. And \d+ one or more numbers

(?:...) this is pocket too, but it not will be saved in $prize, because we used ?: after (

As we know . is any single char, therefore for conversion it to dot we need escape it as \., \.|, means we looking . or ,

/here pattern/i modificator i here means, that regex will be no case insensitive

/here pattern/s modificator s means that metacharacter . will include char of new line.

Winston
  • 1,758
  • 2
  • 17
  • 29
0

Prize pool:\s*<\/th>\s*<td>\s+(.*)\s+<\/td>

If you want to parse HTML to get this value only, just use regex; No need to use full HTML parser to capture a number from html string.

Use Rubular to test you regex.

Yousf
  • 3,957
  • 3
  • 27
  • 37
0
$reg = '~Prize pool:.*?td>\s*(.*?)\s*<~';

rubular demo

Wh1T3h4Ck5
  • 8,399
  • 9
  • 59
  • 79