2

I can get the page source with browser--chrome's head on.

vim  get_with_head.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = Options()
browser = webdriver.Chrome(executable_path="/usr/bin/chromedriver",options=chrome_options)
browser.maximize_window()
wait = WebDriverWait(browser, 40)
url="https://www.nasdaq.com/market-activity/quotes/nasdaq-ndx-index"
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
print(browser.page_source)

It works fine.

python3  get_with_head.py

The chrome opens the webpage,all content in the webpage showns ,now i add three lines to make it a headless browser :

chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")

The whole codes:

vim get_without_head.py

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(executable_path="/usr/bin/chromedriver",options=chrome_options)
browser.maximize_window()
wait = WebDriverWait(browser, 40)
url="https://www.nasdaq.com/market-activity/quotes/nasdaq-ndx-index"
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
print(browser.page_source)

It can't get the content on the webpage:

python3  get_without_head.py
<html><head>
<title>Access Denied</title>
</head><body>
<h1>Access Denied</h1>
 
You don't have permission to access "http://www.nasdaq.com/market-activity/quotes/nasdaq-ndx-index" on this server.<p>
Reference #18.4660dc17.1631258672.2c70b7e3


</p></body></html>

Why can get all content with browser's head on instead of in headless status ?

showkey
  • 482
  • 42
  • 140
  • 295
  • page_source is a best attempt at the page's source. It does NOT guarentee it will be a full DOM dump. – DMart Sep 10 '21 at 15:23

1 Answers1

5

Why?

Headless mode uses its own default User-Agent if it is not given as an argument. However some webpages may block Headless mode User-Agent to avoid unwanted traffic. It may result in Access denied error while trying to open a webpage.

An exemplary default User-Agent for headless mode:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/60.0.3112.50 Safari/537.36

As you see, it explicitly shows that browser is running on Headless mode.

Solution:

Change the User-Agent option.

windows_useragent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36"
linux_useragent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36"
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument("--headless")
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36'
chrome_options.add_argument(f'user-agent={user_agent}')
browser = webdriver.Chrome(options=chrome_options)
browser.maximize_window()
wait = WebDriverWait(browser, 40)
url="https://www.nasdaq.com/market-activity/quotes/nasdaq-ndx-index"
browser.get(url)
wait.until(lambda e: e.execute_script('return document.readyState') != "loading")
print(browser.page_source)
Muhteva
  • 2,729
  • 2
  • 9
  • 21