I am writing a script in Python to monitor the change of a website. The aim is, once an element in the page is updated (e.g. a button from non-existent to existent), I'll receive a notification. I don't need to login to an account or something on the website. Because I don't have too much knowledge in web development, I just found some code and modifies to meet my need. Basically it looks like this:
import time
import datetime
import random
from selenium import webdriver
from fake_useragent import UserAgent
from selenium.webdriver.support.wait import WebDriverWait
screen_dims = [(375, 667), (411, 731), (360, 640), (414, 736), (375, 812),
(768, 1024), (1024, 1366), (540, 720)]
def main():
while (True):
ua = UserAgent()
user_agent = ua.random
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches",
["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('disable-infobars')
options.add_argument(f'user-agent={user_agent}')
driver = webdriver.Chrome(chrome_options=options)
set_viewport_size(driver)
driver.get(a_url_to_the_page_of_interest)
available = check_availability(driver)
if (available):
print("Found")
break
else:
driver.quit()
time.sleep(10)
continue
def set_viewport_size(driver):
width, height = random.choice(screen_dims)
window_size = driver.execute_script(
"""
return [window.outerWidth - window.innerWidth + arguments[0],
window.outerHeight - window.innerHeight + arguments[1]];
""", width, height)
driver.set_window_size(*window_size)
def check_availability(driver):
try:
if (driver.find_element_by_id("privacy-button-id")):
driver.find_element_by_id("privacy-button-id").click()
except:
pass
try:
if (driver.find_element_by_id("some-other-button")):
return True
except:
return False
The problem is, after the 3rd or 4th iteration in the main()
loop, the website that I monitor will direct me to a Captcha page (due to frequent refreshing, I guess).
I tried several methods that I can find, like fake user-agent, different viewport size, extend the refresh frequency (wait 10s between each refresh), but none of them works.
Some stackoverflow posts I read and tried are like: this, this, and this
I don't want to interact with the captcha directly. I just want to avoid it. What I can think of is to use different IPs to send every request. However, 1. I don't know if this is helpful, 2. if it is, how can I implement this?
Are there any other choices?
Thank you for your help!