I need to parse the news from several sites with javascript and use selenium + PhantomJS for it. But there are videos on these sites, which are useless for me and I don't need them at all. (I was given an advice to use selenium + Chrome or selenium + Firefox, but I don't need any opening windows during parsing).
These videos start playing automatically according to the site's logic, and in the end of the end exception http.client.RemoteDisconnected: Remote end closed connection without response
throws.
I think it throws because my internet is very slow and videos can't be full loaded with it.
How can I avoid this problem?
May be any content constraints exist in the selenium or PhantomJS?
Full traceback:
File "viralnova/viralnova.py", line 101, in parse_viralnova
_parse_post_link(postlinktest, driver)
File "viralnova/viralnova.py", line 9, in _parse_post_link
driver.get(post_link)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 309, in get
self.execute(Command.GET, {'url': url})
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 295, in execute
response = self.command_executor.execute(driver_command, params)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
return self._request(command_info[0], url, body=data)
File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 526, in _request
resp = opener.open(request, timeout=self._timeout)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 526, in open
response = self._open(req, data)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 544, in _open
'_open', req)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 504, in _call_chain
result = func(*args)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1346, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1321, in do_open
r = h.getresponse()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 1331, in getresponse
response.begin()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 297, in begin
version, status, reason = self._read_status()
File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 266, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
Code is here
def _parse_post_link(post_link, driver):
try:
driver.get(post_link)
except Exception:
return None
post_page_soup = Soup(driver.page_source, "lxml")
title = post_page_soup.find('div', attrs={'class': 'post-box-detail article'}).h2.text
print(title)
def parse_viralnova(to_csv=True):
driver = webdriver.PhantomJS("/Users/user/.phantomjsdriver/phantomjs")
postlinktest = 'http://www.viralnova.com/restroom-design-fails/'
_parse_post_link(postlinktest, driver)