0

I'm trying to read the website https://koniewyscigowe.pl/wyscig?w=14222-tor-partynice-nagroda-cheval-francais using a Python script.

My code:

web_content = requests.get('https://koniewyscigowe.pl/wyscig?w=14222-tor-partynice-nagroda-cheval-francais')
soup = BeautifulSoup(web_content.text)
for index, table in enumerate(soup.find_all('div', {'class': 'table-responsive'})):
    if index == 0:
        pass
    elif index == 1:
        for starts_stats in table.tbody.find_all('tr'):
            print('HERE WE ARE')

When running this code I got an error

AttributeError: 'NoneType' object has no attribute 'find_all'

Object table.tbody is empty. I can't find any tbody section in the second div class="table-responsive".

When I check how table looks like before generating an error, I see:

<div class="table-responsive">
<table class="table table-striped table-bordered">
<thead>
<tr>
<th style="text-align:center"></th>
<th style="text-align:center">rekord</th>
<th style="text-align:center">koń</th>
<th style="text-align:center">wiek</th>
<th style="text-align:center">powozacy</th>
<th style="text-align:center">trener</th>
<th style="text-align:center">wygrana</th>
</tr>
</thead>
<tr>
<td style="text-align:center">1</td>
<td style="text-align:center">1'29.30"</td>
<td style="text-align:center"><a href="/horse/171-ukamaya-verderie">Ukamaya Verderie</a></td>
<td style="text-align:center">6</td>
<td style="text-align:center"><a href="/dzokej?d=71-robert-kieniksman">pow. R. Kieniksman</a></td>
<td style="text-align:center"><a href="/trener?t=6-andrzej&amp;najderski">A. Najderski</a></td>
<td style="text-align:center">7 000 zł</td>
</tr>
...
</table>
</div>

It doesn't have a tbody section. But when I looked in the browser's inspector, I can see it.

Why is it that table doesn't see tbody?

Here's the view in the element inspector:

[1]

ggorlen
  • 44,755
  • 7
  • 76
  • 106
CezarySzulc
  • 1,849
  • 1
  • 14
  • 30
  • @ggorlen no it doesn't. It's not a java script content – CezarySzulc Aug 23 '21 at 15:08
  • 1
    Your browser parses the raw HTML and adds a `` that you can see in the element inspector (basically, makes an effort to make the HTML valid and better adhere to standards), while `html.parser` doesn't. – ggorlen Aug 23 '21 at 15:09
  • @ggorlen there is a way to 'fix' html to have a standard content? – CezarySzulc Aug 23 '21 at 15:16
  • 1
    I modified your sentence "I checked source of webiste" to "I looked in the browser's inspector" because these are totally different things. The second one deals with the way the site looks after the browser manipulates it, whereas checking the source (`view-source:https://koniewyscigowe.pl/wyscig?w=14222-tor-partynice-nagroda-cheval-francais`) shows the `` is missing. The `"lxml"` parser usually does a better job of cleaning up the HTML than the default `"html.parser"` does, but I believe you're pretty much tied down to its behavior, and you can't guarantee it'll be identical to chrome. – ggorlen Aug 23 '21 at 15:19
  • 1
    If your goal is to get the `` content from the table, just skip the `.tbody` property in your code. That element isn't in the source and you can't guarantee all parsers will add it on your behalf, although _technically_ all `` should have a `` to wrap its table content.
    – ggorlen Aug 23 '21 at 15:23
  • [Why do browsers insert tbody element into table elements?](https://stackoverflow.com/questions/938083/why-do-browsers-insert-tbody-element-into-table-elements) seems like the correct dupe target. See also [Why do browsers still inject in HTML5?](https://stackoverflow.com/questions/7490364/why-do-browsers-still-inject-tbody-in-html5) and [Is it necessary to have in every table?](https://stackoverflow.com/questions/3078099/is-it-necessary-to-have-tbody-in-every-table) – ggorlen Aug 23 '21 at 15:25
  • @ggorlen as you sugensted I changed default `html.parser` for `lxml` but this doesn't help but then I change for `html5lib` and it's working fine! Thanks for your help! – CezarySzulc Aug 23 '21 at 15:28
  • No problem, but I strongly suggest skipping the `.tbody` call anyway, even if you find a parser that injects it. Feel free to add a [self answer](https://stackoverflow.com/help/self-answer) since you found a solution. – ggorlen Aug 23 '21 at 15:29
  • Ok, I will skip this. – CezarySzulc Aug 23 '21 at 15:30

1 Answers1

0

@ggorlen sugest for using different parser because content that I looked in the browser's inspector adds by itself a that I can saw. After used html5lib it works fine. It does mean parser fix content from webiste automaticlly added a missing things. There is recommendation for skipping the missing call anyway, even if you find a parser that injects it

import html5lib
web_content = requests.get('https://koniewyscigowe.pl/wyscig?w=14222-tor-partynice-nagroda-cheval-francais')
soup = BeautifulSoup(web_content.text, "html5lib")
for index, table in enumerate(soup.find_all('div', {'class': 'table-responsive'})):
    if index == 0:
        pass
    elif index == 1:
        for starts_stats in table.tbody.find_all('tr'):
            print('HERE WE ARE')
CezarySzulc
  • 1,849
  • 1
  • 14
  • 30