-1

i am totally new to Scrapy, now i am working on a project, which i need to use Scrapy crawling from this website:https://www.google.com/partners/#a_search;bdgt=10000;lang=en;locn=United%20States;motv=0;wbst=http%253A%252F%252F
i can't pass the whole URL to response in Scrapy, so i used PYCHARM to debug it, i found that i can only pass the URL before #, can anybody help me to solve this problem? thanks a lot!!!!

jess1818
  • 3
  • 1
  • hope [this](http://stackoverflow.com/questions/33395133/scrapy-google-crawl-doesnt-work/33395421#33395421) helps – eLRuLL Nov 28 '16 at 20:09
  • i tried [link](https://www.google.com/partners/?a_search....)[link](https://www.google.com/partners/?search...)both of them doesn't work:( – jess1818 Nov 28 '16 at 21:29
  • Or try PhantomJS + Selenium inside Scrapy .... [look at my answer](http://stackoverflow.com/a/40833619/4094231) – Umair Ayub Dec 01 '16 at 14:48

1 Answers1

3

Url fragment (the part after #) is not sent to remote web servers; this is how HTTP works. Fragment is handled by a browser after request is sent; in case of Google it triggers some JavaScript functions, etc.

Scrapy is not a browser - it doesn't evaluate JavaScript; Scrapy just downloads data via HTTP. That's the reason fragment is stripped from URL when Scrapy fetches a page - there is no way to use it.

If you want to handle such URLs fragments you have two options:

  1. emulate what is browser doing - inspect what HTTP requests it is making when you pass this URL and emulate them in Scrapy;
  2. use a browser engine to render a page, e.g. Selenium, PhantomJS or Splash. There is a plugin for scrapy+splash integration: https://github.com/scrapy-plugins/scrapy-splash.
Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65