Selenium browser is getting an enable cookies page, not the page I am sending it to

Question

I am trying to scrape a js website with selenium. When beautiful soup reads what selenium retrieved I get an html page that says: "Cookies must be enabled in order to view this page." If anyone could help me past this stumbling block I would appreciate it. Here is my code:

# import libraries and specify URL
import lxml as lxml
import pandas as pd
from bs4 import BeautifulSoup
import html5lib
from selenium import webdriver
import urllib.request
import csv

url = "https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2020/06/09"

#new chrome session
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path= '/Users/susanwhite/PycharmProjects/Horse 
Racing/chromedriver', chrome_options=chrome_options)



# Wait for the page to fully load
driver.implicitly_wait(time_to_wait=10)

# Load the web page
driver.get(url)
cookies = driver.get_cookies()




# Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(soup)

score 0 · Answer 1 · answered Mar 27 '22 at 04:37

0

Try removing this line: chrome_options.add_argument("--incognito"). There's no need for it, as Selenium naturally doesn't save cookies or any other information from websites.

answered Mar 27 '22 at 04:37

hymn0

11
2

Thank you for getting back to me. The incognito mode was just me trying different options to see if I could get something to work. I tried both with and without and got the same results. – Spiker Mar 27 '22 at 15:57
1

No problem. I gave another answer to a problem that looks like yours. If you wanna check it: https://stackoverflow.com/questions/71634382/how-to-accept-span-button-using-selenium/71635019#71635019 – hymn0 Mar 27 '22 at 19:01
I used some of the options that you had in the above answer, and I got it to work. Once. After that It gives me a "SessionNotCreatedException: Message: session not created" error – Spiker Mar 28 '22 at 01:10
Try using a newer version of Chrome on the user agent, or checking this answer: https://stackoverflow.com/questions/60296873/sessionnotcreatedexception-message-session-not-created-this-version-of-chrome – hymn0 Mar 28 '22 at 02:28

score 0 · Answer 2 · answered Jan 08 '23 at 18:49

0

Removing below code solved it for me, but headless mode will be disabled and the browser window will be visible.

chrome_options.add_argument("--headless")

answered Jan 08 '23 at 18:49

Md. Jahurul Islam

51
7

TJ HK · Answer 3 · 2023-01-17T14:02:55.867

Your issues might also be with the specific website you're accessing. I had the same issue, and after poking around with it, it looks like something in the way the HKJC website loads, selenium thinks the page is finished loading prematurely. I was able to get good page_source objects out of fetching the page by putting in a time.sleep(30) after the get statement, so my code looks like:

from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time

options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'C:\Python\webdrivers\geckodriver.exe')
driver.get("https://racing.hkjc.com/racing/information/English/Racing/LocalResults.aspx?RaceDate=2023/01/01&RaceNo=1")
time.sleep(30)
html = driver.page_source
with open('Date_2023-01-01_Race1.html', 'wb') as f:
     f.write(html.encode('utf-8'))
     f.close()

You might not have to sleep that long. I found manually loading the pages takes 20+ seconds for me because I have slow internet over VPNs. It also works headless for me as above.

You do have to make sure the Firefox geckodriver is the latest (at least according to other posts, as I only tried this over ~2 days, so not long enough for my installed Firefox and geckodriver to get out of sync)

Selenium browser is getting an enable cookies page, not the page I am sending it to

3 Answers3