1

Thank you all in advance for all your patience and kindness. I am new to docker and have trouble utilizing it to achieve my task. Please kindly let me know if I posted anything incorrectly instead of down voting.

I am working on a crawler project and trying to use docker + selenium + multiple proxies. Here is my ideal work flow: There is one machine containing multiple containers, one spider(selenium) script runs in one container, and each script will finish "fetch and render website -> extract information -> update local database" process.

I know how to change proxy in selenium in local. Here is part of my script:

def get_driver():
    driver = getattr(threadLocal, 'driver', None)
    if driver is None:
        chromeOptions = webdriver.ChromeOptions()
        chromeOptions.add_argument("--headless")
        chromeOptions.add_argument('--proxy-server=http://pubproxy.com/api/proxy?format=txt')
        prefs = {
            "profile.managed_default_content_settings.images": 2,
            'disk-cache-size': 4096,
            'permissions.default.stylesheet': 2
        }
        chromeOptions.add_experimental_option('prefs', prefs)
        driver = webdriver.Chrome(chrome_options=chromeOptions)
        setattr(threadLocal, "driver", driver)
    return driver

I am able to finish "fetch and render website -> extract information -> update local database" process with proxy in local but I am not sure whether I can deploy it on multiple docker containers.

I checked this post: How to configure special proxy settings for a remote selenium webdrive with python?. But I am still confused that whether I should change the proxy in docker setting because I saw such an example:

from selenium import webdriver 
chrome = webdriver.Remote( command_executor='http://localhost:4444/wd/hub', desired_capabilities=DesiredCapabilities.CHROME )

If I change the IP in docker setting, does it mean I need to stop and open it repeatedly?(That is not ideal for a spider?)

Am I able to pass configuration parameters to the docker chrome selenium webdriver or do I need to build the docker container with the proxy settings preconfigured before building it?

Any hint on achieving/optimizing this work flow would be appreciated. I am following the official docker tutorial and got lost. Sincerely hope someone can help me on the right direction.

Vanderwood
  • 163
  • 4
  • 13

1 Answers1

0

You can pass proxy to function e.g. get_driver(proxy)

Yo can make a simple function choosing from list of given proxies in file that shares a volume with your container, which you will be able to edit any time.

def get_driver_with_random_proxy():
    with open('proxy_file.txt') as file:
        proxies_list = file.read().split('\n')
        random_proxy = random.choice(proxies_list)
    return get_driver(random_proxy)
Alex Triphonov
  • 1
  • 1
  • 1
  • 3
  • Thank you so much. This is helpful. And I have some more questions. Seems I can open multiple containers on different ports and they can get the proxies from one sharing file/database. Is it correct? Or should I just used 4444? I am still confused about this part. – Vanderwood Oct 18 '19 at 21:56
  • @Serena you can use any ports you like. And you probably want second filled with used proxies to make sure each container uses its own proxy. – Alex Triphonov Dec 17 '19 at 15:22