i am totally new to Scrapy, now i am working on a project, which i need to use Scrapy crawling from this website:https://www.google.com/partners/#a_search;bdgt=10000;lang=en;locn=United%20States;motv=0;wbst=http%253A%252F%252F
i can't pass the whole URL to response in Scrapy, so i used PYCHARM to debug it, i found that i can only pass the URL before #, can anybody help me to solve this problem? thanks a lot!!!!
Asked
Active
Viewed 187 times
-1

jess1818
- 3
- 1
-
hope [this](http://stackoverflow.com/questions/33395133/scrapy-google-crawl-doesnt-work/33395421#33395421) helps – eLRuLL Nov 28 '16 at 20:09
-
i tried [link](https://www.google.com/partners/?a_search....)[link](https://www.google.com/partners/?search...)both of them doesn't work:( – jess1818 Nov 28 '16 at 21:29
-
Or try PhantomJS + Selenium inside Scrapy .... [look at my answer](http://stackoverflow.com/a/40833619/4094231) – Umair Ayub Dec 01 '16 at 14:48
1 Answers
3
Url fragment (the part after #) is not sent to remote web servers; this is how HTTP works. Fragment is handled by a browser after request is sent; in case of Google it triggers some JavaScript functions, etc.
Scrapy is not a browser - it doesn't evaluate JavaScript; Scrapy just downloads data via HTTP. That's the reason fragment is stripped from URL when Scrapy fetches a page - there is no way to use it.
If you want to handle such URLs fragments you have two options:
- emulate what is browser doing - inspect what HTTP requests it is making when you pass this URL and emulate them in Scrapy;
- use a browser engine to render a page, e.g. Selenium, PhantomJS or Splash. There is a plugin for scrapy+splash integration: https://github.com/scrapy-plugins/scrapy-splash.

Mikhail Korobov
- 21,908
- 8
- 73
- 65