0

I need to parse the news from several sites with javascript and use selenium + PhantomJS for it. But there are videos on these sites, which are useless for me and I don't need them at all. (I was given an advice to use selenium + Chrome or selenium + Firefox, but I don't need any opening windows during parsing).

These videos start playing automatically according to the site's logic, and in the end of the end exception http.client.RemoteDisconnected: Remote end closed connection without response throws.

I think it throws because my internet is very slow and videos can't be full loaded with it.

How can I avoid this problem?

May be any content constraints exist in the selenium or PhantomJS?

Full traceback:

File "viralnova/viralnova.py", line 101, in parse_viralnova
    _parse_post_link(postlinktest, driver)
  File "viralnova/viralnova.py", line 9, in _parse_post_link
    driver.get(post_link)
  File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 309, in get
    self.execute(Command.GET, {'url': url})
  File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 295, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 464, in execute
    return self._request(command_info[0], url, body=data)
  File "/Users/user/anaconda/envs/env/lib/python3.6/site-packages/selenium/webdriver/remote/remote_connection.py", line 526, in _request
    resp = opener.open(request, timeout=self._timeout)
  File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 526, in open
    response = self._open(req, data)
  File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 544, in _open
    '_open', req)
  File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1346, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Users/user/anaconda/envs/env/lib/python3.6/urllib/request.py", line 1321, in do_open
    r = h.getresponse()
  File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/Users/user/anaconda/envs/env/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

Code is here

def _parse_post_link(post_link, driver):
    try:
        driver.get(post_link)
    except Exception:
        return None

    post_page_soup = Soup(driver.page_source, "lxml")
    title = post_page_soup.find('div', attrs={'class': 'post-box-detail article'}).h2.text
    print(title)

def parse_viralnova(to_csv=True):
    driver = webdriver.PhantomJS("/Users/user/.phantomjsdriver/phantomjs")
    postlinktest = 'http://www.viralnova.com/restroom-design-fails/'
    _parse_post_link(postlinktest, driver)
h3llca7
  • 61
  • 1
  • 8

1 Answers1

0

If it's just parsing the text content that you're after, you might consider using just Python and BeautifulSoup. You shouldn't be triggering anything in the browser this way since you won't use one at all (you mentioned you don't need windows opening) and at the same time the solution will be faster lacking that browser overhead.

If you do need some javascript loaded, you can try using dryscape as well.

Zegarek
  • 6,424
  • 1
  • 13
  • 24
  • There is javascript on the sites which I need to parse and the requests module, which I also had tried, didn't load js – h3llca7 Nov 15 '17 at 12:35
  • It wasn't mentioned in your original question but an answer to that would be using dryscape: https://stackoverflow.com/a/26440563/5298879 to make it work on Windows you'll also need to install Cygwin – Zegarek Nov 15 '17 at 13:40