1

I am using Multiprocessing by using this library from multiprocessing import Pool.

Though I am using requests, I want to use selenium as some data is being loaded in pop-up. What is most better way to use Phantomjs without getting into memory leak?

Volatil3
  • 14,253
  • 38
  • 134
  • 263
  • Setup a selenium grid with `maxInstances` set to something each node can handle, that way you can add nodes as needed? How many instances are you looking for? How many requests per minute? If that is not an option, perhaps consider reusing the selenium sessions and rotate through them while they make requests? – jmunsch Mar 20 '17 at 08:03
  • @jmunsch did not know about `Selenium Grid`. Since I am willing to use Parallel processing so 5 instance at a time. Each request would have 2-5 seconds delay. – Volatil3 Mar 20 '17 at 08:45
  • @jmunsch Second, I need a server based solution, this grid seems to install Java – Volatil3 Mar 20 '17 at 08:53
  • What do you mean by "a server based solution"? Is it for unit testing? Or something like crawling/scraping? If it's for unit testing maybe consider using xvfb with pyvirtualdisplay, if its for scraping crawling then continue on with what you are doing, but consider putting it inside a docker container for memory reasons, and adding a rest interface in front of it that way you can handle memory leaks by "rebooting" the docker container. Also that way you can horizontally scale it with a load balancer that points to all the containers and rest interfaces. – jmunsch Mar 20 '17 at 17:51
  • @jmunsch you have given some strong suggestions. Can you guide me some resource to learn further about such setup? – Volatil3 Mar 21 '17 at 17:49
  • https://github.com/wernight/docker-phantomjs AND http://stackoverflow.com/questions/30323224/deploying-a-minimal-flask-app-in-docker-server-connection-issues AND https://docs.docker.com/compose/django/ these are all pieces to what I was imagining. `xvfb-run` allows headless browsing with chrome/firefox/opera etc: http://manpages.ubuntu.com/manpages/trusty/man1/xvfb-run.1.html – jmunsch Mar 21 '17 at 23:49

1 Answers1

1

The basic idea roughly translated might look like this:

from __future__ import unicode_literals
import logging
from werkzeug.routing import Map
from werkzeug.exceptions import HTTPException
from werkzeug.wrappers import Request
class WebApp(object):

    def __init__(self, **kw):
        self.log = logging.getLogger(__name__)

    def __call__(self, environ, start_response):
        return self.wsgi_app(environ, start_response)

    def wsgi_app(self, environ, start_response):
        request = Request(environ)
        response = self.dispatch_request(request)
        return response(environ, start_response)

    def dispatch_request(self, request):
        adapter = self.url_map.bind_to_environ(request.environ)
        try:
            endpoint, values = adapter.match()
            method = getattr(self, 'endpoint_{}'.format(endpoint))
            return method(adapter, request, **values)
        except HTTPException, e:
            return e

    url_map = Map([])


from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from subprocess import Popen, PIPE
import multiprocessing
display = Display(visible=0, size=(800, 600))
display.start()

def get_proxy_obj():
    proxy = '123.456.789.012'

    proxyobj = Proxy({
        'proxyType': ProxyType.MANUAL,
        'httpProxy': proxy,
        'ftpProxy': proxy,
        'sslProxy': proxy,
        'noProxy': '' # set this value as desired
    })
    capabilities = DesiredCapabilities().FIREFOX
    capabilities['acceptSslCerts'] = True
    proxyobj.add_to_capabilities(capabilities)
    return capabilities





drivers = [
     Firefox(FirefoxProfile('/etc/firefox/u2vgyy61.Proxied_User/'),
             capabilities=get_capabilities()),
     Firefox(FirefoxProfile('/etc/firefox/u2vgyy61.Proxied_User/'),
             capabilities=get_capabilities()),
     Firefox(FirefoxProfile('/etc/firefox/u2vgyy61.Proxied_User/'),
             capabilities=get_capabilities())
 ]

class Routes(WebApp):
    def endpoint_get_response(self, adapter, request, **values):
        url = request.values.get("query_param_here","")
        if url:
            # something better here
            while True:
                try:
                    driver = driver.pop()
                    resposne_txt = driver.get(url)
                    # response_txt = Popen(['docker', "exec", "-it", "selenium_phantom", url]).communicate()[0]
                    drivers.append(driver)
                    return Response(response_text)
                except:
                    sleep(1)
                    continue

        else:
            return Response("Not", status=400)

    url_map = Map([
            Rule('/get_response', endpoint='get_response', methods=['GET']),
        ])

for example usage:

curl http://node1/get_response?query_param_here=http://stackoverflow.com
curl http://node2/get_response?query_param_here=http://stackoverflow.com
curl http://node3/get_response?query_param_here=http://stackoverflow.com
curl http://node4/get_response?query_param_here=http://stackoverflow.com
...
and so on

with a loadbalancer infront like:

jmunsch
  • 22,771
  • 11
  • 93
  • 114