0

I am very new to web scraping and I am trying different ways to run this code that works with the same tabular scraping on the same website (different URL though) but I am getting nowhere.

working code:

browser = webdriver.Chrome()

urls = {
    "https://www.oddsportal.com/soccer/england/premier-league/results/"
}
class GameData:

    def __init__(self):
        self.date = []
        self.time = []
        self.game = []
        self.score = []
        self.home_odds = []
        self.draw_odds = []
        self.away_odds = []
        self.country = []
        self.league = []


def parse_data(url):
    browser.get(url)
    df = pd.read_html(browser.page_source, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
    main = content.find('th', {'class': 'first2 tl'})
    if main is None:
        return None
    count = main.findAll('a')
    country = count[1].text
    league = count[2].text
    game_data = GameData()
    game_date = None
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            game_date = row[1].split('-')[0]
            continue
        game_data.date.append(game_date)
        game_data.time.append(row[1])
        game_data.game.append(row[2])
        game_data.score.append(row[3])
        game_data.home_odds.append(row[4])
        game_data.draw_odds.append(row[5])
        game_data.away_odds.append(row[6])
        game_data.country.append(country)
        game_data.league.append(league)
    return game_data




if __name__ == '__main__':

    results = None

    for url in urls:
        try:
            game_data = parse_data(url)
            if game_data is None:
                continue
            result = pd.DataFrame(game_data.__dict__)
            if results is None:
                results = result
            else:
                results = results.append(result, ignore_index=True)
        except ValueError:
            game_data = parse_data(url)
            if game_data is None:
                continue
            result = pd.DataFrame(game_data.__dict__)
            if results is None:
                results = result
        except AttributeError:
            game_data = parse_data(url)
            if game_data is None:
                continue
            result = pd.DataFrame(game_data.__dict__)
            if results is None:
                results = result
            else:
                results = results.append(result, ignore_index=True)

df:

|    | date              | time   | game                             | score   |   home_odds |   draw_odds |   away_odds | country   | league         |
|----|-------------------|--------|----------------------------------|---------|-------------|-------------|-------------|-----------|----------------|
|  0 | Yesterday, 11 May | 19:15  | Southampton - Crystal Palace     | 3:1     |        1.89 |        3.8  |        4.11 | England   | Premier League |
|  1 | Yesterday, 11 May | 17:00  | Manchester Utd - Leicester       | 1:2     |        3.72 |        3.58 |        2.07 | England   | Premier League |
|  2 | 10 May 2021       | 19:00  | Fulham - Burnley                 | 0:2     |        2.24 |        3.44 |        3.38 | England   | Premier League |
|  3 | 09 May 2021       | 18:00  | Arsenal - West Brom              | 3:1     |        1.5  |        4.53 |        6.76 | England   | Premier League |
|  4 | 09 May 2021       | 15:30  | West Ham - Everton               | 0:1     |        2.15 |        3.56 |        3.48 | England   | Premier League |

Here's what I have found as far as differences go:

Xpath of working code url:

urls = {
    "https://www.oddsportal.com/soccer/england/premier-league/results/"
}

//*[@id="tournamentTable"]/tbody/tr[4]/td[2]/a

Xpath of desired url:

urls = {
    "https://www.oddsportal.com/matches/soccer/20210515/"
}

//*[@id="table-matches"]/table/tbody/tr[2]/td[2]/a[2]

When I run

if main is None:
    return None
count = main.findAll('a')
print(len(count))

I get

2

I had asked this question before and tried

content = cont.find('div', {'id': 'col-content'})
content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
main = content.find('th', {'class': 'first2 tl'})

However I am more confused than before.

Any guidance will be much appreciated.

PyNoob
  • 223
  • 1
  • 14

2 Answers2

1

Your variable count has a len of 2. Python indexes start in 0, this means count[2] will give you an error (There are only 2 elements in the list).

Please change

country = count[1].text
league = count[2].text

to

country = count[0].text
league = count[1].text

programandoconro
  • 2,378
  • 2
  • 18
  • 33
0

Since you're using selenium anyway and the site has jQuery:

data = driver.execute_script('''
  let [country, league] = $('.bflp + a').get().map(a => a.innerText.trim())
  return $('tr.deactivate').get().map(tr => {
    let tds = $(tr).find('td').get().map(td => td.innerText.trim())
    return {
      date: $(tr).prevAll('tr.center').find('th').first().text().trim(),
      time: tds[0],
      game: tds[1],
      score: tds[2],
      home_odds: tds[3],
      draw_odds: tds[4],
      away_odds: tds[5],
      country,
      league
    }
  })
''')

If you don't like that you can just use the bit that gets country and league.

pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Where would I add that pit of code and/substitute in mine? – PyNoob May 14 '21 at 05:38
  • Add it anywhere after driver.get. Actually you are calling it browser.get, I would change that to be more consistent with other selenium code. – pguardiario May 14 '21 at 09:42
  • Apologies to get your attention here however can you pease help me [answer this question](https://stackoverflow.com/questions/75058259/scraping-oddsportal-for-matches-and-odds) The website changed yesterday and I would need to rewrite the scraper and I am requesting your help in his regards. – PyNoob Jan 11 '23 at 13:41