How do I Iterate between Xpath using Beautifulsoup?

Question

I am trying to iterate between xpath on www.oddsportal.com

I tested the code and it works for element.click() as below:

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.oddsportal.com/matches/soccer/")
element = browser.find_element_by_xpath("/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[4]/div/div/span/a[3]")
element.click()

The Xpath that I want the urls to iterate between are:

/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[2]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[3]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[4]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[5]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[6]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[7]
/html/body/div[1]/div/div[2]/div[6]/div[1]/div/div[1]/div[2]/div[1]/div[3]/div/div/span/a[8]

I have a code that scrapes any given set of urls as below:

import pandas as pd
from selenium import webdriver
from datetime import datetime
from bs4 import BeautifulSoup as bs

browser = webdriver.Chrome()

urls = {
    "https://www.oddsportal.com/matches/soccer/"
}


class GameData:
    def __init__(self):
        self.country = []


def parse_data(url):
    browser.get(url)
    df = pd.read_html(browser.page_source, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
    main = content.find('th', {'class': 'first2 tl'})
    if main is None:
        return None
    count = main.findAll('a')
    country = count[0].text
    game_data = GameData()
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            country = row[1].split('»')[0]
            continue
        game_data.country.append(country)

    return game_data


if __name__ == '__main__':

    results = None

    for url in urls:
        game_data = parse_data(url)
        if game_data is None:
            continue
        result = pd.DataFrame(game_data.__dict__)
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

How can I integrate xpath into this code?

I tried the solutions discussed here however I am getting nowhere or probably I am a bit early into the learning curve.

That xpath doesn't match anything for me. What are you actually trying to click and then do? Is it navigating to indiv matches pages? — QHarr, May 24 '21 at 05:58
xpath was a bit tricky with me. I had to use the full xpath for the `element.click()` to work. You can try xpath i.e. `"//*[@id="col-content"]/div[3]/div/div/span/a[3]"` I am trying to iterate between the "Tomorrow" and further paths. As an example: https://imgur.com/2ImzaCe — PyNoob, May 24 '21 at 06:03

QHarr · Accepted Answer · 2021-05-30T01:37:16.170

You could just construct them based on subtracting or adding to today's date. However, you can also use nth-child to extract the relevant nodes, specifying the first (yesterday) anchor tag, then nth-child range to get from tomorrow onwards; combining them with Or syntax. You don't need to specify Today, as that is the landing page. Then you can browser.get to each extracted link in a loop over the returned list:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.oddsportal.com/matches/soccer/')
other_days = [i.get_attribute('href') 
              for i in browser.find_elements_by_css_selector('.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]
print(other_days)
for a_day in other_days:
    browser.get(a_day)
    #do something

Integrating with your code shared in comments (meant a re-write of some of your existing class):

import pandas as pd
from selenium import webdriver
from datetime import datetime
from bs4 import BeautifulSoup as bs

class GameData:
    def __init__(self):
        self.country = []


def get_urls(browser, landing_page):
    
    browser.get(landing_page)
    urls = [i.get_attribute('href') for i in 
           browser.find_elements_by_css_selector('.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]
    
    return urls

def parse_data(html):

    df = pd.read_html(html, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
    main = content.find('th', {'class': 'first2 tl'})
    
    if main is None:
        return None
    
    count = main.findAll('a')
    country = count[0].text
    game_data = GameData()
    
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            country = row[1].split('»')[0]
            continue
        game_data.country.append(country)

    return game_data


if __name__ == '__main__':
  
    start_url = "https://www.oddsportal.com/matches/soccer/"
    urls = []
    browser = webdriver.Chrome()
    results = None
    urls = get_urls(browser, start_url)
    urls.insert(0, start_url)
    
    for number, url in enumerate(urls):
        if number > 0:
            browser.get(url)
        html = browser.page_source
        game_data = parse_data(html)
        
        if game_data is None:
            continue
        
        result = pd.DataFrame(game_data.__dict__)
        
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

This is great! I can iterate between the different days. Thanks! Now, As you can see the code, browser.get is used under `def parse_data(url)` and then the dataframe is appended at `for url in urls`, how can I use your method in my code? — PyNoob, May 24 '21 at 20:17
Is the returned list supposed to feed into `for url in urls` ? — QHarr, May 24 '21 at 20:23
Yes, complying with the principles of SO, I did not include dataframe which defines a number of other attributes which then get appended into the dataframe which feeds into `for url in urls` and appends the dataframe. — PyNoob, May 24 '21 at 20:26
Yes, earlier code, I used to have url for each day as defined in urls = {} as the dict takes multiple url which was cumbersome and hence this question. If you would like further clarity, This is my existing code: https://www.file.io/download/Jfk1JrNMdyBm and I am modifying it to accomodate your method https://file.io/wiAuoXMfjnCy however I am unable to accomodate the iterations yet. — PyNoob, May 24 '21 at 20:37
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/232832/discussion-between-qharr-and-pynoob). — QHarr, May 24 '21 at 20:48

How do I Iterate between Xpath using Beautifulsoup?

1 Answers1