2

I have this strange situation,

I have a link that works on all borwsers that I currently have (chrome,IE,firefox), I tried to crawl the page using scrapy in python. however I get response.status == 400, I am using tor + polipo to crawl anonymously

response.body is :

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
<title>Proxy error: 400 Couldn't parse URL.</title>
</head><body>
<h1>400 Couldn't parse URL</h1>
<p>The following error occurred while trying to access <strong>https://exmpale.com/blah</strong>:<br><br>
<strong>400 Couldn't parse URL</strong></p>
<hr>Generated Thu, 11 Dec 2014 13:55:38 UTC by Polipo on <em>localhost:8123</em>.
</body></html>

I'm just wondering why that should be, could it be that browser can get results but not scrapy?

nafas
  • 5,283
  • 3
  • 29
  • 57
  • Could you show the code you are using here? – selllikesybok Dec 11 '14 at 14:04
  • 1
    Perhaps the server is blocking scrapy? Try changing the user agent. – Ramchandra Apte Dec 11 '14 at 14:06
  • @selllikesybok well scrapy project is quite big to fit it here, is there any specific part I should show? – nafas Dec 11 '14 at 14:08
  • @RamchandraApte so far I tried over 10 different agents, I keep trying to see if I get anything – nafas Dec 11 '14 at 14:09
  • @nafas The website may be blocking because of Tor. – Ramchandra Apte Dec 11 '14 at 14:10
  • @nafas for starters, the line where response is assigned a value? – selllikesybok Dec 11 '14 at 14:11
  • @RamchandraApte well it works for some of the pages, but not all – nafas Dec 11 '14 at 14:13
  • @selllikesybok the response is assigned through scrapy. I don't have access to that really – nafas Dec 11 '14 at 14:17
  • @nafas that might be a missing `User-Agent` header. Are you setting it? – alecxe Dec 11 '14 at 14:25
  • @alecxe well I've tried it with many different agents, the thing is it works for most of the pages(from same domain) but for some like example above doesn't work. – nafas Dec 11 '14 at 14:28
  • @nafas I mean that they can be checking your User-Agent header for matching a real browser. – alecxe Dec 11 '14 at 14:33
  • @alecxe [http://techpatterns.com/downloads/firefox/useragent_switcher_agents.txt](http://techpatterns.com/downloads/firefox/useragent_switcher_agents.txt) I'm using this source to get my USER_AGENTS, the thing I don't understand is why it works for some of the urls but not others – nafas Dec 11 '14 at 14:35
  • 1
    It could be that you are making too many requests in a short amount of time, and the server doesn't like it. Try setting the `DOWNLOAD_DELAY` to something like `0.25`, or use the [AutoThrottle extension](http://doc.scrapy.org/en/latest/topics/autothrottle.html) to see what happens. – bosnjak Dec 12 '14 at 10:17

0 Answers0