I have a web crawler using Selenium and Chromium running on Ubuntu Linux 16.04. All new crawling requests come in to Apache/WSGI, which creates a new python thread for each request and spawns Chromium process with pyvirtualdisplay and Xvfb to load the website, login, take screenshots, etc.
I use Chromium with the flags: disable-extensions, disable-gpu, headless, no-sandbox
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "none"
I then have a function that checks every second to see if the page is loaded yet (as some of the pages don't fully load within a reasonable time, so I try to wait until they're at least interactive before proceeding):
driver.execute_script("var state = document.readyState; return state;")
The weird thing is that now when I try to load a page, it immediately says it's in state 'complete' (and continues to be so for the next 15 seconds). But when I actually try to find an element, it can't be found - so I don't think it's actually loaded. Normally it will say it's 'loading' and then 'interactive', etc.
I've tried restarting Apache, but doesn't seem to have fixed anything. What could be wrong?
I can see in my process list that Chromium and Xvfb are indeed running when the new request comes in:
7429 ? S 0:00 Xvfb -br -nolisten tcp -screen 0 1024x768x24 :2165
7430 ? Sl 0:00 /var/www/html/flaskapp/chromedriver --port=39146
7438 ? Sl 0:00 /usr/lib/chromium-browser/chromium-browser --disable-background-networking --disable-client-side-phishing
7440 ? S 0:00 /usr/lib/chromium-browser/chromium-browser --type=zygote --no-sandbox --enable-logging --headless --log-l
7457 ? Sl 0:00 /usr/lib/chromium-browser/chromium-browser --type=gpu-process --no-sandbox --enable-logging --headless --
7468 ? S 0:00 /usr/sbin/apache2 -k start
7469 ? Sl 0:00 /usr/lib/chromium-browser/chromium-browser --type=renderer --no-sandbox --enable-automation --enable-logg