0

I use Selenium and python to do some web scraping work. The program works fine on my local machine. I now want to run the scraping job on my university's HPC cluster. I use Selenium with Firefox. I have install all modules and Firefox driver.

from selenium import webdriver
browser = webdriver.Firefox()   # open firefox
browser.get(url)

After the second line, HPC opened one XQuartz window of Firefox (I am using Mac), and one line Traceback (see full traceback at the bottom):

selenium.common.exceptions.WebDriverException: Message: connection refused

This is not what I want. I want the HPC run the python program without opening XQuartz locally. I am very new to both selenium and python. So I probably misused selenium. I also tried requests module. It is fine, but I need to click something.

Full traceback. cz93 is my username on the server, and ccPython is the directory containing my python virtual environment.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/c/cz93/ccPython/lib/python3.6/site-packages/selenium/webdriver/firefox/webdriver.py", line 162, in __init__
    keep_alive=True)
  File "/home/c/cz93/ccPython/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
    self.start_session(desired_capabilities, browser_profile)
  File "/home/c/cz93/ccPython/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/c/cz93/ccPython/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
    self.error_handler.check_response(response)
  File "/home/c/cz93/ccPython/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: connection refused
semibruin
  • 963
  • 1
  • 9
  • 18
  • 1
    Please include the full traceback. `connection refused` usually is because of a firewall issue; note that selenium uses local ports to interface with the webdriver. Also, how does XQuartz fit in here? Why is it necessary? If the HPC environment is headless, you should use the headless options for firefox. – sytech Mar 26 '18 at 16:57
  • @sytech I have edited the question to include full traceback. XQuartz opened itself after running `browser = webdriver.Firefox()`. You comment about `headless` is interesting. What is `headless`? I was googling around. Does not `phantomJS` help here? – semibruin Mar 26 '18 at 17:28
  • 1
    PhantomJS is no longer supported since Chrome and Firefox support headless modes. See [this answer](https://stackoverflow.com/a/46768243/5747944) for using firefox in headless mode with selenium. – sytech Mar 26 '18 at 17:30
  • Very cool. I love `headless`. I was running with head locally, and head is annoying. Thanks a lot. – semibruin Mar 26 '18 at 17:34

1 Answers1

0

Unfortunately, the short answer is that you almost certainly can't. Your HPC nodes cannot connect to the internet, and this is not what they were designed for in the first place.

nathan liang
  • 1,000
  • 2
  • 11
  • 22