14

We've been using information from a site for a while now (something that the site allows if you mention the source and we do) and we've been copying the information by hand. As you could imagine this can become tedious pretty fast so I've been trying to automate the process by fetching the information with a PHP script.

The URL I'm trying to fetch is:

http://mediaforest.ro/weeklycharts/viewchart.aspx?r=WeeklyChartRadioLocal&y=2010&w=46 08-11-10 14-11-10

If I enter it in a browser it works, if I try a file_get_contents() I get Bad Request

I figured that they checked to see if the client is a browser so I rolled a CURL based solution:

$ch = curl_init();

$header=array(
  'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12',
  'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language: en-us,en;q=0.5',
  'Accept-Encoding: gzip,deflate',
  'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
  'Keep-Alive: 115',
  'Connection: keep-alive',
);

curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_COOKIEFILE,'cookies.txt');
curl_setopt($ch,CURLOPT_COOKIEJAR,'cookies.txt');
curl_setopt($ch,CURLOPT_HTTPHEADER,$header);
$result=curl_exec($ch);

curl_close($ch);

I've checked and the headers are identical with my browser's headers and I still get Bad Request

So I tried another solution:

http://www.php.net/manual/en/function.curl-setopt.php#78046

Unfortunately this doesn't work either and I'm out of ideas. What am I missing?

netcoder
  • 66,435
  • 19
  • 125
  • 142
pandronic
  • 631
  • 2
  • 9
  • 21
  • 8
    Did you use `urlencode` on the URL before calling `file_get_contents`? – Evan Mulawski Nov 15 '10 at 13:38
  • Damn, that's embarrassing ... how could I miss that? – pandronic Nov 15 '10 at 13:48
  • 1
    Well, it works even with file_get_contents(), so there's no protection whatsoever. Sorry for wasting everybody's time :) – pandronic Nov 15 '10 at 13:56
  • 1
    @pandronic: I would suggest still masking it. They could be watching the logs and mightn't like people scraping the data, so take precautions to prevent them from blocking automated tools. (just my $0.02) – Reese Moore Nov 15 '10 at 14:00
  • Just to add, sometimes for https urls you need turn off verify_peer curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE ); – fedmich Dec 03 '13 at 04:06

3 Answers3

12

Try escaping your URL, it works for me that way.

http://mediaforest.ro/weeklycharts/viewchart.aspx?r=WeeklyChartRadioLocal&y=2010&w=46%2008-11-10%2014-11-10
Reese Moore
  • 11,524
  • 3
  • 24
  • 32
11

Use curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12');

You can replace the useragent with another one of course.

However, "Bad Request" is most likely NOT related to a missing/bad useragent. It sounds like the webserver itself doesn't like your request.. not the application behind the requested URI.

ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
0

I had to lose 'Accept-Encoding: gzip,deflate', from the $header to get it to work properly on my godaddy website.