scrapy has response status 400 , but browser response is ok?

Question

I have this strange situation,

I have a link that works on all borwsers that I currently have (chrome,IE,firefox), I tried to crawl the page using scrapy in python. however I get response.status == 400, I am using tor + polipo to crawl anonymously

response.body is :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
<title>Proxy error: 400 Couldn't parse URL.</title>
</head><body>
<h1>400 Couldn't parse URL</h1>
<p>The following error occurred while trying to access <strong>https://exmpale.com/blah</strong>:<br><br>
<strong>400 Couldn't parse URL</strong></p>
<hr>Generated Thu, 11 Dec 2014 13:55:38 UTC by Polipo on <em>localhost:8123</em>.
</body></html>

I'm just wondering why that should be, could it be that browser can get results but not scrapy?

Perhaps the server is blocking scrapy? Try changing the user agent. — Ramchandra Apte, Dec 11 '14 at 14:06
@selllikesybok well scrapy project is quite big to fit it here, is there any specific part I should show? — nafas, Dec 11 '14 at 14:08
@RamchandraApte so far I tried over 10 different agents, I keep trying to see if I get anything — nafas, Dec 11 '14 at 14:09
@nafas for starters, the line where response is assigned a value? — selllikesybok, Dec 11 '14 at 14:11
@RamchandraApte well it works for some of the pages, but not all — nafas, Dec 11 '14 at 14:13
@selllikesybok the response is assigned through scrapy. I don't have access to that really — nafas, Dec 11 '14 at 14:17
@nafas that might be a missing `User-Agent` header. Are you setting it? — alecxe, Dec 11 '14 at 14:25
@alecxe well I've tried it with many different agents, the thing is it works for most of the pages(from same domain) but for some like example above doesn't work. — nafas, Dec 11 '14 at 14:28
@nafas I mean that they can be checking your User-Agent header for matching a real browser. — alecxe, Dec 11 '14 at 14:33
@alecxe [http://techpatterns.com/downloads/firefox/useragent_switcher_agents.txt](http://techpatterns.com/downloads/firefox/useragent_switcher_agents.txt) I'm using this source to get my USER_AGENTS, the thing I don't understand is why it works for some of the urls but not others — nafas, Dec 11 '14 at 14:35
It could be that you are making too many requests in a short amount of time, and the server doesn't like it. Try setting the `DOWNLOAD_DELAY` to something like `0.25`, or use the [AutoThrottle extension](http://doc.scrapy.org/en/latest/topics/autothrottle.html) to see what happens. — bosnjak, Dec 12 '14 at 10:17

scrapy has response status 400 , but browser response is ok?

0 Answers0