Thank you all in advance for all your patience and kindness. I am new to docker and have trouble utilizing it to achieve my task. Please kindly let me know if I posted anything incorrectly instead of down voting.
I am working on a crawler project and trying to use docker + selenium + multiple proxies. Here is my ideal work flow: There is one machine containing multiple containers, one spider(selenium) script runs in one container, and each script will finish "fetch and render website -> extract information -> update local database" process.
I know how to change proxy in selenium in local. Here is part of my script:
def get_driver():
driver = getattr(threadLocal, 'driver', None)
if driver is None:
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
chromeOptions.add_argument('--proxy-server=http://pubproxy.com/api/proxy?format=txt')
prefs = {
"profile.managed_default_content_settings.images": 2,
'disk-cache-size': 4096,
'permissions.default.stylesheet': 2
}
chromeOptions.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chromeOptions)
setattr(threadLocal, "driver", driver)
return driver
I am able to finish "fetch and render website -> extract information -> update local database" process with proxy in local but I am not sure whether I can deploy it on multiple docker containers.
I checked this post: How to configure special proxy settings for a remote selenium webdrive with python?. But I am still confused that whether I should change the proxy in docker setting because I saw such an example:
from selenium import webdriver
chrome = webdriver.Remote( command_executor='http://localhost:4444/wd/hub', desired_capabilities=DesiredCapabilities.CHROME )
If I change the IP in docker setting, does it mean I need to stop and open it repeatedly?(That is not ideal for a spider?)
Am I able to pass configuration parameters to the docker chrome selenium webdriver or do I need to build the docker container with the proxy settings preconfigured before building it?
Any hint on achieving/optimizing this work flow would be appreciated. I am following the official docker tutorial and got lost. Sincerely hope someone can help me on the right direction.