2

I have a code which scrape friend list from Facebook UID. It worked but it takes a long time to scrape a whole list. So, I want to speed it up by using multiprocessing and Selenium Grid. The following is the approach I use:

  1. Login Facebook with account
  2. Open 5 instances Firefox with same cache and cookie ( so I don't need to login again)
  3. Scrape friend list from 5 different UID simultaneously. 1 instance/1 UID

This is my code but it doesn't work

import multiprocessing
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
from selenium import webdriver

def friend_uid_list(uid, driver):
    driver.get('https://www.facebook.com/' + uid + '/friends')
    //scrape friend list
    target.close()

def g(arg):
    return friend_uid_list(*arg)

if __name__ == '__main__':

    driver = webdriver.Firefox()
    driver.get("https://www.facebook.com/")
    driver.find_element_by_css_selector("#email").send_keys("email@gmail.com")
    driver.find_element_by_css_selector("#pass").send_keys("password")
    driver.find_element_by_css_selector("#u_0_m").click()

    pool = multiprocessing.Pool(5)
    pool.map(g, [(100004159542140,driver),(100004159542140,driver),(100004159542140,driver)])

So, can you show me how to use Selenium Grid to use multiple instances simultaneously ? I searched a lot but don't know how to implement it to my code. Thank you :)

jmunsch
  • 22,771
  • 11
  • 93
  • 114
NGuyen
  • 265
  • 5
  • 13

1 Answers1

2

Here is another approach without using selenium grid.

This approach opens 5 firefox instances, as well as 3 windows on each instance. The cookies are copied over from the main instance.

from pyvirtualdisplay import Display
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import multiprocessing
display = Display(visible=0, size=(800, 600))
display.start()
d = webdriver.Firefox()

def friend_uid_list(uid, driver):
    values = []
    for handle in driver.window_handles:
        driver.switch_to_window(handle)
        # driver.wait_for_element() etc etc
        values.append(driver.find_element_by_id('#something'))
        # scrape elements
    return values

def g(arg):
    return friend_uid_list(*arg)

Start an instance and log in:

d = webdriver.Firefox()
d.get("https://www.facebook.com/")
d.find_element_by_css_selector("#email").send_keys("email@gmail.com")
d.find_element_by_css_selector("#pass").send_keys("password")
d.find_element_by_css_selector("#loginbutton").click()

Start multiple instances:

drivers = [webdriver.Firefox(), webdriver.Firefox(), webdriver.Firefox(), webdriver.Firefox()]

Copy the localStorage:

localstorage_kv = d.execute_script("var obj={};for (var i=0,len=localStorage.length;i<len;++i){obj[localStorage.key(i)]=localStorage.getItem(localStorage.key(i));};return obj")

Copy the cookies and localStorage:

for e in drivers:
    e.get("https://www.facebook.com/")
    for x in d.get_cookies():
        e.add_cookie(x)
    for k, v in localstorage_kv.items():
        e.execute_script('localStorage.setItem("{}", {})'.format(k,v))
    e.refresh() # should be logged in now

Add the initial driver back into the drivers array:

drivers.append(d)

And then loop over the uids:

uids = [100004159542140, 100004159542140, 100004159542140, 100004159542140, 100004159542140, 100004159542140]
pool = multiprocessing.Pool(5)

while uids:
    for driver in drivers:
        if len(driver.window_handles) == 1:
            driver.execute_script('window.open("https://www.facebook.com/' + uids.pop() + '/friends")')            
            driver.execute_script('window.open("https://www.facebook.com/' + uids.pop() + '/friends")')            
        else:
            for handle in driver.window_handles:
                handle.get("https://www.facebook.com/" + uids.pop() + "/friends")
    return_values = pool.map(g, drivers)
    import pdb;pdb.set_trace()

If you really want to share the cookies across nodes on a selenium grid look:

Which roughly means pickle the localStorage and cookies and transfer that to each node, from there then read the cookie into each instance on each node.

Community
  • 1
  • 1
jmunsch
  • 22,771
  • 11
  • 93
  • 114
  • hello, it shows an error when I run the code "ImportError: No module named 'selenium.common.keys' – NGuyen Aug 07 '16 at 10:04
  • @NGuyen will update now, should be `selenium.webdriver.common.keys` – jmunsch Aug 07 '16 at 10:05
  • Thank you but I got another error: easyprocess.EasyProcessCheckInstalledError: cmd=['Xvfb', '-help'] OSError=[WinError 2] The system cannot find the file specified Program install error! – NGuyen Aug 07 '16 at 10:16
  • I am using win10 64bit – NGuyen Aug 07 '16 at 10:16
  • 1
    it's not necessary to use `xvfb` but if you want to run it headless it is a decent way to do it. Look into `xvfbwrapper`. – jmunsch Aug 07 '16 at 10:24
  • Hello, it worked after I remove xvfb. However, I got another error: selenium.common.exceptions.WebDriverException: Message: You may only set cookies for the current domain – NGuyen Aug 07 '16 at 10:42
  • @NGuyen You're right, I missed putting in any lines to navigate to facebook, and refresh the browser. I added those back in. – jmunsch Aug 07 '16 at 10:46
  • @NGuyen some basic steps to setup selenium grid: https://www.youtube.com/watch?v=9b5lGfBKzj0 – jmunsch Aug 07 '16 at 10:51
  • Hi, I got another error: "for k, v in localstorage_kv: ValueError: too many values to unpack (expected 2)" – NGuyen Aug 07 '16 at 11:17