Scraping complicated tables with BeautifulSoup

Question

I'm working on a sports betting scraper, however I'm encountering a complicated table. The code below shows how most of the elements look. My main focus is to extract all the text from it (the name of the participants, the date & time, odds, etc)

<tr data-qa="pre-event" class="events-list__grid__event"><th scope="row" class="events-list__grid__info"><div class="events-list__grid__info__datetime"><div class="events-list__grid__info__datetime__time">
            20:05
          </div> <div class="events-list__grid__info__datetime__date">
            24/07
          </div></div> <a href="/cote/sara-errani-paula-ormaechea/27034463/" class="GTM-event-link events-list__grid__info__main" data-testid="TENN" title="WTA - Varșovia - Calificări (F)"><div class="events-list__grid__info__main__row"><div class="events-list__grid__info__main__participants"><div class="events-list__grid__info__main__participants__participant"><span class="events-list__grid__info__main__participants__participant-name"><!---->
                  Sara Errani
                  <!----></span> <!----></div><div class="events-list__grid__info__main__participants__participant"><span class="events-list__grid__info__main__participants__participant-name"><!---->
                  Paula Ormaechea
                  <!----></span> <!----></div> <!----></div> <div class="events-list__grid__info__main__actions"><span class="event-icons"><!----> <!----> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M18.545 6H5.455C4.655 6 4 6.668 4 7.5v9c0 .825.655 1.5 1.455 1.5h13.09c.8 0 1.455-.675 1.455-1.5v-9c0-.832-.655-1.5-1.455-1.5zm0 10.5H5.455v-9h13.09v9zM9.818 9v6l5.091-3-5.09-3z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 kz-icon icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M7.833 19.5H9.5V8.03H7.833V19.5zm3.334 0h1.666v-15h-1.666v15zm-6.667 0h1.667v-7.941H4.5V19.5zm10 0h1.667V8.03H14.5V19.5zm3.333-7.941V19.5H19.5v-7.941h-1.667z"></path></svg> <svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M14.2 4.534a.532.532 0 00-.344-.504.503.503 0 00-.572.17l-6.07 7.862a.96.96 0 00-.131.996c.147.33.466.542.817.542h1.928c.142 0 .258.12.258.267v5.6c0 .226.138.428.344.503a.503.503 0 00.572-.17l6.07-7.862a.96.96 0 00.13-.996.899.899 0 00-.817-.542h-1.928a.262.262 0 01-.257-.267v-5.6z"></path></svg> <!----> <!----></span> <!----></div></div> <!----></a></th> <td class="table__markets__market"><div><section><div class="table__markets__market__title"><div class="table__markets__market__title__text">
      Câştigător
    </div> <div class="table__markets__market__title__markets"><a href="/cote/sara-errani-paula-ormaechea/27034463/" class="table__markets__market__title__markets__link">
        +4
      </a></div></div> <div class="selections"><button aria-label="Bet on Sara Errani with odds 1.17." data-selnid="2685084631" data-qa="pre-event-selection" class="selections__selection selections__selection--columns-2 GTM-selection-add" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span class="selections__selection__odd"><!--fragment#15ac200c85#head-->
    1.17
  <!--fragment#15ac200c85#tail--></span></button><button aria-label="Bet on Paula Ormaechea with odds 4.6." data-selnid="2685084632" data-qa="pre-event-selection" class="selections__selection selections__selection--columns-2 GTM-selection-add" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span class="selections__selection__odd"><!--fragment#80111e10a3#head-->
    4.60
  <!--fragment#80111e10a3#tail--></span></button> <!----></div></section></div></td><td class="table__markets__market"></td><td class="table__markets__market"></td> <td class="events-list__grid__total-markets">
        +4
      </td></tr>

In this case, what I need are: '20:05; 24/07; Sara Errani; Paula Ormaechea; +4; 1.17; 4.6' + the link above 'Sara Errani'.

How can I loop through all the tr elements and extract the relevant data?

Show us your code, what have you tried until now? What is the url of the page you are scraping? — Barry the Platipus, Jul 24 '22 at 17:46
I really haven't tried anything more than trying to find the table and printing it which returned nothing. When I print the whole page through bs4, it returns html code where the table elements can't be found for some reason. URL: https://ro.betano.com/sport/tenis/meciurile-urmatoare-de-azi/ — goldie, Jul 24 '22 at 17:48
That's a website only available in your country, it seems. Based on your description those tables are dynamic, loaded by javascript. — Barry the Platipus, Jul 24 '22 at 19:27
Yes, that seems to be the case. Upon reading what requests.get exports, there's only one line of the table showing under the scripts tag. Is there any way to scrape data loaded by js? — goldie, Jul 24 '22 at 19:32
Try to find the apis accessed by javascript (xhr calls in Network tab). Failing that, selenium. — Barry the Platipus, Jul 24 '22 at 19:33
There are 3 xhr reports in the Network tab, but I have no idea what to do with them or what even to look for. Any guidance? — goldie, Jul 24 '22 at 19:36
Inspect them - see what sort of call is being made to each of them - GET or POST? see what payload is being sent, look up the response, is it json, and does it contain the data you need? — Barry the Platipus, Jul 24 '22 at 19:38
2 GET & 1 POST, the GET ones have json code in the response but nothing that seems useful, while the POST one has payload but again, nothing rings any bells. — goldie, Jul 24 '22 at 19:43

d r · Accepted Answer · 2022-07-25T06:26:31.270

With html_doc containing your data from the question:

analyze soup and create mappings of data you want to extract
- find classes/ids/names of the tags that you want to extract (in this case only classes)
- define tag and number of them to extract
- construct your own mappings which will give you possibility to create iteration
iterate through the mappings
- do the job using your mappings
collect results

Regards...

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
results = {}

for k, lst in mappings.items():
    for i in range(lst[2]):
        elems = soup.find_all(lst[0], attrs={'class': lst[1]})
        if k != 'href':
            results[k + '_' + str(i + 1)] = elems[i].text.strip()
        else:
            results[k + '_' + str(i + 1)] = elems[i]['href']

print(results)
#
#   R e s u l t :
#
#   { 
#     'time_1': '20:05', 
#     'date_1': '24/07', 
#     'href_1': '/cote/sara-errani-paula-ormaechea/27034463/', 
#     'name_1': 'Sara Errani', 
#     'name_2': 'Paula Ormaechea', 
#     'link_1': '+4', 
#     'odd_1': '1.17', 
#     'odd_2': '4.60'
#   }

ADDITION:
With your latest data as html_doc (https://pastebin.com/nx6x00NX)
Added row iterations and event numbers.
Function pretty() by STH (user:56338) from ( How to pretty print nested dictionaries? )
If you can get the table definition soup it will work with this table rows iteration - the rest of the code is the same as it was

from bs4 import BeautifulSoup

def pretty(dct, indent=0):      # function by ---> STH user:56338
    for key, value in dct.items():
        print('\t' * indent + str(key))
        if isinstance(value, dict):
            pretty(value, indent+1)
        else:
            print('\t' * (indent+1) + str(value))
         
soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
    
events = {}
results = {}
rows = soup.find_all("tr", attrs={'class': "events-list__grid__event"})
nr = 0
for row_soup in rows:
    for k, lst in mappings.items():
        for i in range(lst[2]):
            elems = row_soup.find_all(lst[0], attrs={'class': lst[1]})
            if k != 'href':
                results[k + '_' + str(i + 1)] = elems[i].text.strip()
            else:
                results[k + '_' + str(i + 1)] = elems[i]['href']
    nr += 1
    events['event_' + str(nr)] = results
    results = {}
    
pretty(events)
#
'''     R e s u l t
event_1
        time_1
                22:47
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/sophia-yang-tatum-burger/27018714/
        name_1
                Sophia Yang
        name_2
                Tatum Burger
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_2
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/cleo-hutchinson-seha-yu/27018746/
        name_1
                Cleo Hutchinson
        name_2
                Seha YU
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_3
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/laura-bente-josie-frazier/27018754/
        name_1
                Laura Bente
        name_2
                Josie Frazier
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_4
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/kelly-keller-emma-sun/27018749/
        name_1
                Kelly Keller
        name_2
                Emma Sun
        link_1
                +4
        odd_1
                1.45
        odd_2
                2.60
event_5
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/nadia-kojonroj-tanvi-narendran/27018750/
        name_1
                Nadia Kojonroj
        name_2
                Tanvi Narendran
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
'''

'IndexError: list index out of range' There seems to be a problem. The html code from my original question comes from inspect element into my own browser. When doing requests.get, the html document exported does not contain the table, only a — goldie, Jul 24 '22 at 19:32
Sorry, but the access to that site is alowed only from your country. I can not test anything from here. But, if the table is generated by the script in real time then maybe you will niot be in a position to get a soup with the data you are trying to extract. I can't test anything. The answer is reffering to the soup in the question. — d r, Jul 24 '22 at 19:46
From what I read somewhere else too, it's not possible to load javascript through beautifulsoup so I'll combine it with Selenium to first download the website as a html file, then inspect it. I did that to run your code but it's not iterating, it's only printing the first result, then exiting. How can I make it iterate? — goldie, Jul 24 '22 at 20:02
This works on data from question. I would have to see the complete html to make different mappings. I assume there are multiple tables. If the tables are of the same structure then you can get them one by one and use the code from the answer for each of them. If they are not then I should do new mappings. Anyway, I could not answer it without the soup. — d r, Jul 24 '22 at 20:17
There's only one table, but each tr element has the exactly same classes. The code does iterate through the lines but because the classes are shared, it only gives the first round of data. Here are some of the lines: https://pastebin.com/nx6x00NX , here's a screenshot of inspect element: https://prnt.sc/_LooUTe57Y5X I can send the whole html document too if that's needed — goldie, Jul 24 '22 at 20:28
I will see it in the morning, but in my mappings the last element in the list says to iterate 1 or 2 elements. You can try to increase the numbers and see what happens. If you put too big number you will get out of scope error. There will probably be usefull to include tr iterations too. I will check it tomorrow. — d r, Jul 24 '22 at 20:59
@goldie I just posted the addition with solution for your latest soup. Regards... — d r, Jul 25 '22 at 06:11

Scraping complicated tables with BeautifulSoup

1 Answers1