0

I'm not sure what the problem is. But I have a small script using Selenium and Beautifulsoup 4 to visit and parse contents of www.oddsportal.com

Code below not looping for league

The row no is [1] for game_data.league.append(count[1].text) but the value is repeating for that webpage instead for every row.

My code:

import pandas as pd
from selenium import webdriver
from datetime import datetime
from bs4 import BeautifulSoup as bs
from math import nan


browser = webdriver.Chrome()


class GameData:
    def __init__(self):
        self.score = []
        self.date = []
        self.time = []
        self.country = []
        self.league = []
        self.game = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []

    def append(self, score):
        pass


def get_urls(browser, landing_page):
    browser.get(landing_page)
    urls = [i.get_attribute('href') for i in
            browser.find_elements_by_css_selector(
                '.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]

    return urls


def parse_data(html):
    df = pd.read_html(html, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
    main = content.find('th', {'class': 'first2 tl'})

    if main is None:
        return None

    count = main.findAll('a')
    country = count[0].text
    game_data = GameData()
    game_date = datetime.strptime(soup.select_one('.bold')['href'].split('/')[-2], '%Y%m%d').date()

    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            country = row[1].split('»')[0]
            continue
        game_time = row[1]
        score = row[3] if row[3] else nan

        game_data.date.append(game_date)
        game_data.time.append(game_time)
        game_data.country.append(country)
        game_data.league.append(count[1].text)
        game_data.game.append(row[2])
        game_data.score.append(score)
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])

    return game_data


if __name__ == '__main__':

    start_url = "https://www.oddsportal.com/matches/soccer/"
    urls = []
    browser = webdriver.Chrome()
    results = None
    urls = get_urls(browser, start_url)
    urls.insert(0, start_url)

    for number, url in enumerate(urls):
        if number > 0:
            browser.get(url)
        html = browser.page_source
        game_data = parse_data(html)

        if game_data is None:
            continue

        result = pd.DataFrame(game_data.__dict__)

        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

results:

+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+
|     | score                   | date       | time   | country   | league        | game                    |   home_odds |   draw_odds |   away_odds |
+=====+=========================+============+========+===========+===============+=========================+=============+=============+=============+
| 496 | Inter Turku - Mariehamn | 2021-06-10 | 15:00  | Finland   | Veikkausliiga | Inter Turku - Mariehamn |        1.4  |        4.6  |        7.49 |
+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+
| 497 | KTP - HIFK              | 2021-06-10 | 15:30  | Finland   | Veikkausliiga | KTP - HIFK              |        3.42 |        3.17 |        2.18 |
+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+
| 498 | Haka - HJK              | 2021-06-10 | 15:30  | Finland   | Veikkausliiga | Haka - HJK              |        6.56 |        4.25 |        1.47 |
+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+
| 499 | SJK - KuPS              | 2021-06-10 | 15:30  | Finland   | Veikkausliiga | SJK - KuPS              |        3.34 |        3.25 |        2.18 |
+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+
| 500 | Lahti - Ilves           | 2021-06-10 | 15:30  | Finland   | Veikkausliiga | Lahti - Ilves           |        2.5  |        3.08 |        2.93 |
+-----+-------------------------+------------+--------+-----------+---------------+-------------------------+-------------+-------------+-------------+

How do I get to loop the correct values for every row instead of the same value for the entire page?

PyNoob
  • 223
  • 1
  • 14

1 Answers1

1

To answer your specific problem, and not address the other issues I see, you need to alter your logic for determining when to add league

if n == 0 or '»' in row[1]:
    league = leagues[n]
    n+=1

I would also retrieve leagues as its own list:

leagues = [i.text for i in soup.select('.first2 > a:last-child')]

import pandas as pd
from selenium import webdriver
from datetime import datetime
from bs4 import BeautifulSoup as bs
from math import nan


browser = webdriver.Chrome()


class GameData:
    def __init__(self):
        self.score = []
        self.date = []
        self.time = []
        self.country = []
        self.league = []
        self.game = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []

    def append(self, score):
        pass


def get_urls(browser, landing_page):
    browser.get(landing_page)
    urls = [i.get_attribute('href') for i in
            browser.find_elements_by_css_selector(
                '.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]

    return urls


def parse_data(html):
    df = pd.read_html(html, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
    main = content.find('th', {'class': 'first2 tl'})

    if main is None:
        return None

    count = main.findAll('a')
    country = count[0].text
    game_data = GameData()
    game_date = datetime.strptime(soup.select_one('.bold')['href'].split('/')[-2], '%Y%m%d').date()
    leagues = [i.text for i in soup.select('.first2 > a:last-child')]

    n = 0
    
    for row in df.itertuples():
        if n == 0 or '»' in row[1]:
            league = leagues[n]
            n+=1
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            country = row[1].split('»')[0]
            continue
        game_time = row[1]
        score = row[3] if row[3] else nan

        game_data.date.append(game_date)
        game_data.time.append(game_time)
        game_data.country.append(country)
        game_data.league.append(league)
        game_data.game.append(row[2])
        game_data.score.append(score)
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])

    return game_data


if __name__ == '__main__':

    start_url = "https://www.oddsportal.com/matches/soccer/"
    urls = []
    browser = webdriver.Chrome()
    results = None
    urls = get_urls(browser, start_url)
    urls.insert(0, start_url)

    for number, url in enumerate(urls):
        if number > 0:
            browser.get(url)
        html = browser.page_source
        game_data = parse_data(html)

        if game_data is None:
            continue

        result = pd.DataFrame(game_data.__dict__)

        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Hi, Can you please lead me to where I can learn web scraping better? I just seem to be running into the same problems and I have to ask around for questions. Yes, this code has issues which I am aware of e.g. the `score` loops correctly for where it is available and then it gets `game' value. – PyNoob Jun 08 '21 at 05:04
  • Hiya, I think maybe it helps to simplify the issue. If you have a problem with league value then just focus on the logic that determine league value. Copy the code into another file and remove code that isn't part of the current problem. I printed `row` each time in the loop as logic is applied to this, examined the tests, and also looked at the contents of `count`. It was a logic flaw rather than lack of web-scraping knowledge I think. – QHarr Jun 08 '21 at 05:07
  • In terms of learning better web-scraping I would go through similar questions on SO and challenge yourself to answer them without looking at the existing answers, then compare against the existing. – QHarr Jun 08 '21 at 05:09
  • 1
    Yep. That is something I will do. Thank you. – PyNoob Jun 08 '21 at 05:10
  • BTW, there is nothing wrong with asking. So long as the problem and research is clear. :-) – QHarr Jun 08 '21 at 05:11
  • 1
    You can always drop by the [dawghaus](https://chat.stackoverflow.com/rooms/169987/dawgs-waffle-haus-) and ping me to discuss occasional problems. I am not always there now-a-days but if I am, and it is not too often, I am happy to help discuss things. – QHarr Jun 08 '21 at 05:13
  • can I please ask you to help [answer this question](https://stackoverflow.com/questions/75058259/scraping-oddsportal-for-matches-and-odds). The website [Oddsportal](www.oddsportal.com) changed yesterday hence the scraper broke. Can you please help me fix it? Apologies to grab your attention this way – PyNoob Jan 11 '23 at 12:51