1

I'm in a scraping project and I'm trying to get a page of course.

Here is the code I'm using to open the page:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("user-data-dir=selenium")
print("Opening browser")
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", options=chrome_options)
print("getting request")
driver.get("http://www.tsetmc.com/Loader.aspx?ParTree=15131F")
print("starting wait")
time.sleep(10)
response = driver.page_source
print("got response, quitting...")
driver.quit()

my problem

The problem is that it does nothing when it reachs driver.get() I mean it neither ends the process nor prints "starting wait". (The problem persists both on my laptop and the server) I have tried removing the --headless option, and it works fine on my laptop (Ubuntu 20.04), but when I upload it to my server and run it there (Ubuntu Server 18.04) it chrome crashes (exception message below)

Message: unknown error: Chrome failed to start: exited abnormally. (unknown error: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

So I have came to conclusion that I have to use --headless option since there is no GUI on my server and chrome crashes when it isn't there.

In conclusion I need help to troubleshoot the problem of infinite waiting on driver.get()

PS: I can run run the code below with no problem which is weird for me:

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument("user-data-dir=selenium")
browser = webdriver.Chrome('/usr/lib/chromium-browser/chromedriver', options=chrome_options)
print("open browser")
browser.get("https://www.codal.ir")
print("get")
time.sleep(10)
response = browser.page_source
print("response")
browser.quit()
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Arman Babaei
  • 165
  • 1
  • 9
  • Does the code work from your own computer/laptop? Are you able to run curl http://www.tsetmc.com/Loader.aspx?ParTree=15131F on your server? – Meny Issakov Jul 19 '20 at 08:29
  • @MenyIssakov the problem of waiting is both on my laptop and server, and I can run the link on curl on the server – Arman Babaei Jul 19 '20 at 08:33
  • I'm not familiar with the internals, but it won't work for me unless I remove (chrome_options.add_argument("user-data-dir=selenium")) – Meny Issakov Jul 19 '20 at 08:42
  • @MenyIssakov Thanks, I hadn't noticed that. seems like many people have the problem having both headless and user-data-dir. by the way, I added two three new arguments: ```chrome_options.add_argument('--profile-directory=selenium') chrome_options.add_argument("--remote-debugging-port=9222") chrome_options.add_argument("--window-size=1400x1000")``` and now it waits on response = browser.page_source (not the 10 second sleep) – Arman Babaei Jul 19 '20 at 09:17

1 Answers1

0

Assuming you have created a Chrome Profile by the name selenium you need to add -- before the argument user-data-dir and pass the absolute path of the Chrome Profile Directory as follows:

chrome_options.add_argument("--user-data-dir=/path/to/chromium-profile/selenium")

As an alternative, you can also use the argument --profile-directory as follows:

options.add_argument('--profile-directory=selenium')
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352