-2

How can I execute the below script using multiple browsers?

Every n urls should be executed using a separate browser. I should be able to define the value of n (parallel scraping)

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver

browser = webdriver.Chrome()

class GameData:

    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []

def parse_data(url):
    while True:
        try:
            browser.get(url)
            df = pd.read_html(browser.page_source)[0]
            break
        except KeyError:
            browser.quit()
            continue
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
    main = content.find('th', {'class': 'first2 tl'})
    if main is None:
        return None
    count = main.findAll('a')
    country = count[1].text
    league = count[2].text
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.date.append(game_date)
        game_data.time.append(row[1])
        game_data.game.append(row[2])
        game_data.score.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])
        game_data.country.append(country)
        game_data.league.append(league)
    return game_data

# URLs go here
urls = {
    "https://www.oddsportal.com/soccer/world/international-champions-cup/results/#/",
    "https://www.oddsportal.com/soccer/romania/superliga-women/results/#/",
    "https://www.oddsportal.com/soccer/portugal/league-cup/results/#/",
    "https://www.oddsportal.com/soccer/world/valentin-granatkin-memorial/results/#/",
    "https://www.oddsportal.com/soccer/slovenia/prva-liga/results/#/",
    "https://www.oddsportal.com/soccer/brazil/campeonato-pernambucano/results/#/",
    "https://www.oddsportal.com/soccer/netherlands/eredivisie-cup-women/results/#/",
    "https://www.oddsportal.com/soccer/singapore/premier-league/results/#/",
    "https://www.oddsportal.com/soccer/world/world-cup-women-u20/results/#/",
    "https://www.oddsportal.com/soccer/world/premier-league-asia-trophy/results/#/",
}

if __name__ == '__main__':

    results = None

    for url in urls:
        try:
            game_data = parse_data(url)
            if game_data is None:
                continue
            result = pd.DataFrame(game_data.__dict__)
            if results is None:
                results = result
            else:
                results = results.append(result, ignore_index=True)

print(results)

Currently the script uses one browser window for all urls

How can I modify the code to open multiple browser incidents for every n urls to do the same job faster and then append into results.

  • 2
    I would consider `concurrent.futures` module. https://docs.python.org/3/library/concurrent.futures.html – Kota Mori Jul 12 '21 at 02:01
  • you will have to run every browser in separated `thread` or maybe better in separated `process` - [threading](https://docs.python.org/3/library/threading.html), [multiprocessing](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing) – furas Jul 12 '21 at 05:28
  • @KotaMori This looks promising. However, how can I adapt existing code to include `concurrent.futures` ? –  Jul 12 '21 at 05:40
  • 1
    Does this answer your question? [Python selenium multiprocessing](https://stackoverflow.com/questions/53475578/python-selenium-multiprocessing) Be sure to look at my answer, which provides an important modification to the main answer. – Booboo Jul 12 '21 at 11:47
  • @Booboo Your answer helps me understand the process. I am not sure how I can code it in this specific requirement. Can you help me here please? –  Jul 13 '21 at 01:18
  • 1
    Unfortunately, I am now away for a few days with no access to a computer except for this phone I am using to write this. But I will look at this when I get back if you can wait. – Booboo Jul 13 '21 at 11:05
  • Yep, will put a pointer on this. It is very exciting for me to get to this point as it will bring in a manifold in efficiency of the process. I am handicapped by my learning curve. Thank you! –  Jul 13 '21 at 12:50
  • Hmm, i was going to post an answer for that question but looks like the OP is just seeking help without any self effort taken! this is my previous [answer](https://stackoverflow.com/a/68278490/7658985) for him as the output can be handled with some pandas effort to get the exact shape he's looking for it. this is a waste of time to answer question twice. – αԋɱҽԃ αмєяιcαη Jul 14 '21 at 09:26
  • 1
    BTW @furas consider usage of [arsenic](https://arsenic.readthedocs.io/en/latest/) for such cases, `concurrent.futures` going to eat RAM&CPU highly as it's running under `threads` async is the best use cases for such scenario. – αԋɱҽԃ αмєяιcαη Jul 14 '21 at 09:28
  • @αԋɱҽԃαмєяιcαη I can understand why it does seem that way however, as you would know, editing code runs me into multiple errors. Also, your answer is not for concurrent processes hence this new question. I appreciate your help however, [your solution](https://stackoverflow.com/questions/68277929/selenium-how-do-i-retry-browser-url-when-valueerrorno-tables-found/68278490#68278490)could not be used as the required dataframe was of a different schema than your solution. –  Jul 14 '21 at 10:41
  • @PyNoob_N Well, the community is not a code writing service, even you didn't meet the requirements of [ask] as you didn't show us what you tried out and which issue you are facing and the code you've used for `multithreading`. – αԋɱҽԃ αмєяιcαη Jul 14 '21 at 10:44
  • @αԋɱҽԃαмєяιcαη `arsenic` seems interesting. I will have to test it. Thanks. – furas Jul 14 '21 at 13:19
  • @PyNoob_N You asked me to "Help me here please." So I posted code to show you, so please take a look at it and comment if you have a question. – Booboo Jul 17 '21 at 12:01
  • Yes, I will review the code and update how close i get to what I need. However, I will accept answer once I test the code in my environment. I am in the process of resetting my pc. Thank you for this. –  Jul 17 '21 at 13:34

3 Answers3

1

Using DevTool in Chrome/Firefox (tab: Network, filters: JS, XHR) I found urls used by page to get data from server using AJAX.

https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/xbNfvuAM/X0/1/0/1/
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/l8FEjeUE/X0/1/0/1/

etc.

Urls are similar. Difference is xbNfvuAM, l8FEjeUE which I found in code as PageTournament({"id":"l8FEjeUE", ... and I can generate these urls.

And this way I could create code which gets HTML without using Selenium but only using requests.

Original code needed ~20s and with requests it needs only ~6s.

BTW: I also reduced code in parse_data and use only DataFrame without class GameData

import requests
import json
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
from multiprocessing import Pool

# --- functions ---

def get_html(url):
    r = requests.get(url, headers=headers)
    text = r.text
    start = text.find('PageTournament({"id":"') + len('PageTournament({"id":"')
    end = text.find('"', start)
    code = text[start:end]
    print(f'code: {code}')

    url = f'https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/{code}/X0/1/0/1/'

    r = requests.get(url, headers=headers)
    text = r.text

    # remove `globals.jsonpCallback('...',` at the start
    text = text.split(',', 1)[1]
    text = text[:-2]              # remove `);` at the end

    # print('json:', text[:25], '...', text[-25:])  # may display partially because other processes my put own text
    print(f'json: {text[:25]} ... {text[-25:]}')  # display all in one peace

    data = json.loads(text)
    html = data['d']['html']

    # print('html:', html[:25], '...', html[-25:])  # may display partially because other processes my put own text
    # may display partially because other processes my put own text
    print(f'html: {html[:25]} ... {html[-25:]}')

    return html


def parse_data(html):
    try:
        df = pd.read_html(html)[0]
    except KeyError:
        print('KeyError')
        return

    soup = bs(html, "lxml")
    header = soup.select('table th.first2.tl a')

    if not header:
        return

    df['country'] = header[1].text
    df['league'] = header[2].text

    return df


def process(url):
    return parse_data(get_html(url))

# --- main ---

# needed headers - on some systems it has to be outside `__main__`

headers = {
    'user-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
    'referer': 'https://www.oddsportal.com/',
}

if __name__ == '__main__':

    # urls for AJAX requests
    # ajax_urls = {
    #    # for 'view-source:https://www.oddsportal.com/soccer/romania/superliga-women/results/#/'
    #    'https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/xbNfvuAM/X0/1/0/1/',
    #    # for 'https://www.oddsportal.com/soccer/world/international-champions-cup/results/#/'
    #    'https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/l8FEjeUE/X0/1/0/1/',
    # }
    # you can find `l8FEjeUE` in oriiginal page as `PageTournament({"id":"l8FEjeUE", ...`

    urls = {
        "https://www.oddsportal.com/soccer/world/international-champions-cup/results/#/",
        "https://www.oddsportal.com/soccer/romania/superliga-women/results/#/",
        "https://www.oddsportal.com/soccer/portugal/league-cup/results/#/",
        "https://www.oddsportal.com/soccer/world/valentin-granatkin-memorial/results/#/",
        "https://www.oddsportal.com/soccer/slovenia/prva-liga/results/#/",
        "https://www.oddsportal.com/soccer/brazil/campeonato-pernambucano/results/#/",
        "https://www.oddsportal.com/soccer/netherlands/eredivisie-cup-women/results/#/",
        "https://www.oddsportal.com/soccer/singapore/premier-league/results/#/",
        "https://www.oddsportal.com/soccer/world/world-cup-women-u20/results/#/",
        "https://www.oddsportal.com/soccer/world/premier-league-asia-trophy/results/#/",
    }

    time_start = time.time()

    # empty `DataFrame` so I don't have to check `if results is None`
    results = pd.DataFrame()

    with Pool(10) as p:
        all_game_data = p.map(process, urls)

    for game_data in all_game_data:

        if game_data is None:
            #print('game_data', game_data)
            continue

        results = results.append(game_data, ignore_index=True)

    time_end = time.time()
    time_diff = (time_end - time_start)

    print(f'time: {time_diff:.2f} s')

    print('--- results ---')
    print(results)

EDIT:

As @αԋɱҽԃαмєяιcαη figured out headers have to be outside __main__ because on some systems it may raise error.


Doc: multiprocessing


EDIT:

I created code which uses multiprocessing to run original code.

Problem is that it can't send browser to processes every process has to run own Selenium and it display 5 browser at the same time. And it needs more time to start all browsers and it tooks me ~40s.

Maybe if to run process with queue to get URL and send back HTML then it could reuse one browser (or few browsers to run them at the same time). But it would need more complex code.

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import time
from multiprocessing import Pool

# --- classes ---

class GameData:

    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []

# --- functions ---

def parse_data(url):
    browser = webdriver.Chrome()
    
    while True:
        try:
            browser.get(url)
            df = pd.read_html(browser.page_source)[0]
            break
        except KeyError:
            print('KeyError:', url)
            continue
            
    html = browser.page_source
    browser.quit()            

    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
    main = content.find('th', {'class': 'first2 tl'})
    if main is None:
        return None
    count = main.findAll('a')
    country = count[1].text
    league = count[2].text
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.date.append(game_date)
        game_data.time.append(row[1])
        game_data.game.append(row[2])
        game_data.score.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])
        game_data.country.append(country)
        game_data.league.append(league)
    return game_data

# --- main ---

if __name__ == '__main__':

    # URLs go here
    urls = {
        "https://www.oddsportal.com/soccer/world/international-champions-cup/results/#/",
        "https://www.oddsportal.com/soccer/romania/superliga-women/results/#/",
        "https://www.oddsportal.com/soccer/portugal/league-cup/results/#/",
        "https://www.oddsportal.com/soccer/world/valentin-granatkin-memorial/results/#/",
        "https://www.oddsportal.com/soccer/slovenia/prva-liga/results/#/",
        "https://www.oddsportal.com/soccer/brazil/campeonato-pernambucano/results/#/",
        "https://www.oddsportal.com/soccer/netherlands/eredivisie-cup-women/results/#/",
        "https://www.oddsportal.com/soccer/singapore/premier-league/results/#/",
        "https://www.oddsportal.com/soccer/world/world-cup-women-u20/results/#/",
        "https://www.oddsportal.com/soccer/world/premier-league-asia-trophy/results/#/",
    }


    time_start = time.time()
    
    results = None
    
    with Pool(5) as p:
        all_game_data = p.map(parse_data, urls)
        
    for game_data in all_game_data:
            
        if game_data is None:
            #print('game_data', game_data)
            continue
        
        result = pd.DataFrame(game_data.__dict__)
        
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

    time_end = time.time()
    time_diff = (time_end - time_start)
    print(f'time: {time_diff:.2f} s')
    
    print('--- results ---')
    print(results)    
furas
  • 134,197
  • 12
  • 106
  • 148
  • @αԋɱҽԃαмєяιcαη and @furas, This is a great way to execute! I tested the code. I have a few observations: Every `with Pool` initiates a browser instance for _every url_ and then quits. This essentially allows multiple urls to be loaded _concurrently_ however, the page is loaded for one url and quit. This seems inefficient as the page has to load every execution cycle and then `browser.quit()`. Can the better (efficient) way be that the count of urls are divided by `with Pool(n)` and every browser instance iterates between these urls before quitting? Of course I am just thinking out loud. –  Jul 15 '21 at 00:21
  • @furas headers need to be out of `__main__` in order to be global. otherwise it's will raise `NameError: name 'headers' is not defined` – αԋɱҽԃ αмєяιcαη Jul 15 '21 at 00:30
  • @αԋɱҽԃαмєяιcαη what system do you use? It works correctly on Linux Mint but maybe other systems need it outside `__main__`. – furas Jul 15 '21 at 00:37
  • @αԋɱҽԃαмєяιcαη Yep, that was exactly what I had however the edited one resolves this issue –  Jul 15 '21 at 00:40
  • @furas sounds strange for me. am using Windows 10 currently. could you please run this code for me https://bpa.st/T57Q and see if it's raise NamError or not – αԋɱҽԃ αмєяιcαη Jul 15 '21 at 00:49
  • @αԋɱҽԃαмєяιcαη - I saw you aready edited my answer - so I added description to explain it. Different systems may run processes in different way and they may need different elements outside `__main__` and different inside `__main__`. – furas Jul 15 '21 at 00:50
  • @αԋɱҽԃαмєяιcαη your example code https://bpa.st/T57Q works correctly on Linux. Strange for me is that it works correctly also when I put all outside `__main__` but I expected that new processes will run again `Pool` so it will run it forever. Long time ago I had this problem on Linux - but maybe something changed. – furas Jul 15 '21 at 00:56
  • 1
    @furas Thank you. Just figured it out [On Unix using the fork start method, a child process can make use of a shared resource created in a parent process using a global resource.](https://docs.python.org/3/library/multiprocessing.html#all-start-methods) – αԋɱҽԃ αмєяιcαη Jul 15 '21 at 01:25
  • on Windows, Python basically does `import __main__` while on unix it inherits the existing process with fork. I mean as the note says, it's not a good idea to do it on unix even though you can – αԋɱҽԃ αмєяιcαη Jul 15 '21 at 01:29
0

Here is code that uses a multithreading pool limited to some number of browsers given by MAX_BROWSERS where once a driver has been started it can be reused by future submitted tasks.

Note that I have eliminated the while True: loop at the beginning of function parse_data because frankly I could not understand its function. Naturally, you can restore it if you feel it is required. Whatever you do, you do not want to call browser.quit, however.

In the following example, I have set MAX_BROWSERS = 3:

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import threading
from multiprocessing.pool import ThreadPool

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # Un-comment next line to supress logging:
        #options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        #print('The driver has been "quitted".')

threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver

class GameData:

    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []

def parse_data(url):
    try:
        browser = create_driver()
        browser.get(url)
        df = pd.read_html(browser.page_source)[0]
    except KeyError:
        print('KeyError')
        return None
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
    main = content.find('th', {'class': 'first2 tl'})
    if main is None:
        return None
    count = main.findAll('a')
    country = count[1].text
    league = count[2].text
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.date.append(game_date)
        game_data.time.append(row[1])
        game_data.game.append(row[2])
        game_data.score.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])
        game_data.country.append(country)
        game_data.league.append(league)
    return game_data

# URLs go here
urls = {
    "https://www.oddsportal.com/soccer/world/international-champions-cup/results/#/",
    "https://www.oddsportal.com/soccer/romania/superliga-women/results/#/",
    "https://www.oddsportal.com/soccer/portugal/league-cup/results/#/",
    "https://www.oddsportal.com/soccer/world/valentin-granatkin-memorial/results/#/",
    "https://www.oddsportal.com/soccer/slovenia/prva-liga/results/#/",
    "https://www.oddsportal.com/soccer/brazil/campeonato-pernambucano/results/#/",
    "https://www.oddsportal.com/soccer/netherlands/eredivisie-cup-women/results/#/",
    "https://www.oddsportal.com/soccer/singapore/premier-league/results/#/",
    "https://www.oddsportal.com/soccer/world/world-cup-women-u20/results/#/",
    "https://www.oddsportal.com/soccer/world/premier-league-asia-trophy/results/#/",
}

if __name__ == '__main__':
    results = None
    # To limit the number of browsers we will use
    # (set to a large number if you don't want a limit):
    MAX_BROWSERS = 3
    pool = ThreadPool(min(MAX_BROWSERS, len(urls)))
    for game_data in pool.imap(parse_data, urls):
        if game_data is None:
            continue
        result = pd.DataFrame(game_data.__dict__)
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

    print(results)
    # ensure all the drivers are "quitted":
    del threadLocal
    import gc
    gc.collect() # a little extra insurance

Prints:


DevTools listening on ws://127.0.0.1:61928/devtools/browser/1874311c-a84e-4903-a5da-c64d93dd86cb

DevTools listening on ws://127.0.0.1:61929/devtools/browser/078e7a54-3a0d-43d5-a05e-04feae6242bc

DevTools listening on ws://127.0.0.1:61930/devtools/browser/241ba2b3-a1ab-4a41-8dec-82a051bdc4bc
0           None  16:06       Atl. Madrid - Juventus       2:1       160      +235       166         World  International Champions Cup
1    04 Aug 2019  14:06            Tottenham - Inter  1:2 pen.      -145      +295       363         World  International Champions Cup
2    03 Aug 2019  16:36    Manchester Utd - AC Milan  3:2 pen.      -128      +279       332         World  International Champions Cup
3    28 Jul 2019  19:06           AC Milan - Benfica       0:1       190      +252       131         World  International Champions Cup
4    27 Jul 2019  00:06    Real Madrid - Atl. Madrid       3:7       106      +259       233         World  International Champions Cup
..           ...    ...                          ...       ...       ...       ...       ...           ...                          ...
245  29 Jan 2021  17:30    Den Haag W - VV Alkmaar W       3:1      -312      +424       550   Netherlands         Eredivisie Cup Women
246  04 Dec 2020  17:30  Heerenveen W - VV Alkmaar W       3:1      -244      +373       450   Netherlands         Eredivisie Cup Women
247  04 Dec 2020  17:30    PEC Zwolle W - Den Haag W       3:0       173      +269       119   Netherlands         Eredivisie Cup Women
248  04 Dec 2020  17:30               PSV W - Ajax W       2:2       110      +256       193   Netherlands         Eredivisie Cup Women
249  04 Dec 2020  17:30       Twente W - Excelsior W       9:2     -1667      +867      1728   Netherlands         Eredivisie Cup Women

[250 rows x 9 columns]
Booboo
  • 38,656
  • 3
  • 37
  • 60
-1

I would take the following approach:

  1. Spawn a new thread (info here: Spawning a thread in python)
  2. Create a new instance of the browser with browser = webdriver.Chrome() in each thread
  3. Proceed as normal
Ezra
  • 471
  • 3
  • 14
  • how can I adopt my code to do the same? I am new at development hence this newbie question. – PyNoob Jul 14 '21 at 05:58