I am trying to crawl a website that includes javascript codes and content of the web site preparing with javascript codes.
Installed Scrapy and Splash.
Splash is running with this code
sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash
2015-08-21 07:21:06+0000 [-] Log opened.
2015-08-21 07:21:06.483344 [-] Splash version: 1.7
2015-08-21 07:21:06.490230 [-] Qt 4.8.1, PyQt 4.9.1, WebKit 534.34, sip 4.13.2, Twisted 15.2.1, Lua 5.2
2015-08-21 07:21:06.490505 [-] Open files limit: 524288
2015-08-21 07:21:06.490745 [-] Open files limit increased from 524288 to 1048576
2015-08-21 07:21:06.699607 [-] Xvfb is started: ['Xvfb', ':1087', '-screen', '0', '1024x768x24']
2015-08-21 07:21:06.808450 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2015-08-21 07:21:06.929580 [-] verbosity=1
2015-08-21 07:21:06.929964 [-] slots=50
2015-08-21 07:21:06.930484 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Proxy Server: enabled
2015-08-21 07:21:06.931420 [-] Site starting on 8050
2015-08-21 07:21:06.931640 [-] Starting factory <twisted.web.server.Site instance at 0x1b5b3f8>
2015-08-21 07:21:06.938232 [-] SplashProxyServerFactory starting on 8051
2015-08-21 07:21:06.938468 [-] Starting factory <splash.proxy_server.SplashProxyServerFactory instance at 0x1b5bcf8>
When I wanted to get website code render.html shows "Javascript is not enabled. Please enable JavaScript in your browser".
import scrapy
class xxxxxSpider(scrapy.Spider):
start_urls = ["xxxxx"]
name = "sahibinden"
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5, 'proxy':'xxxxx'}
}
})
def parse(self, response):
with open("result.txt", "a") as myfile:
myfile.write(str(response.css('body').extract()))
All settings are OK.
DOWNLOADER_MIDDLEWARES = {
'scrapyjs.SplashMiddleware': 725,
}
SPLASH_URL = 'http://localhost:8050/'
DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'
I scrapped the web site successfully once. Then I am getting "Javascript is not enabled in your browser" error.
If it helps to solve problem, this is splash output when I render the page.
2015-08-21 08:06:09.838076 [-] "172.17.42.1" - - [21/Aug/2015:08:06:09
+0000] "POST /render.html HTTP/1.1" 200 4048 "-" "Scrapy/1.0.3.post1+g83a06ed (+http://scrapy.org)"
I couldn't understand what is the problem. Any help?
Further Information
I have deleted the virtual machine. IP address is changed then I tried again. It get the results successfully for the first time. But, it couldn't get anything for second request. I think the web site is blocking my ip address.