CSS Selector not following ':nth-child' logic for bs4

Question

I am scraping the following webcomic site with requests and bs4 to download the comic image: www.qwantz.com

In the browser inspector when I select the webcomic element and copy the CSS Selector, I get the following:

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(2) > img')

Looking at the html for the site, this makes sense. The elements in that section are aligned as such:

.\
  <tr>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  <tr>
.\

However, this selector returns an empty list. When I back up the selector to just (... > td') I get the three sibling elements in my selector object.

The following all results in empty lists as well, for each numerical argument I tried 1 - 2:

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(1)')
comicElem = soup.select('body > center > table > tbody > tr > td')[1]

Using comicElem = soup.select('body > center > table > tbody > tr > td > td > img') gets me the results that I want. But I would like to know what is happening here that fails the CSS selector copied from the web inspector. In short, I would like my code to work using the CSS selector copied from the browser inspector. e.g. with td:nth-child(2).

For reference, here is the relevant code:

#! python3
# scheduledWebComicDL.py - Downloads comics but first checks if there is
# an update before

import requests, os, bs4, threading

folderName = 'Web Comics'
os.makedirs(folderName, exist_ok=True) # store comics in folderName

def downloadQwantz():
    # Web comic site to parse.
    site = 'http://www.qwantz.com'

    # Make the soup with requests & bs4.
    res = requests.get(site)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Get image url.
    comicElem = soup.select('body > center > table > tbody > tr > td > td > img')
    comicUrlShort = comicElem[0].get('src')
    comicUrl = site + '/' + comicUrlShort

    # Confirm that there is an img url.
    if comicElem == 0:
        print('Could not find comic element at %s.' % (site))

    # Begin download of img with url.
    else:
            checkAndDownload(comicUrl)

downloadQwantz()

full code has been provided, along with more specifically what I want to achieve. — Cambuchi, Dec 30 '20 at 15:49
I've voted to re-open. I think you only need the code relating to downloadQwant() and you can simplify that to only a few lines to demonstrate the problem. It may be bad html tripping up the parser. — QHarr, Dec 30 '20 at 19:26

QHarr · Accepted Answer · 2020-12-31T02:07:02.317

The problem is the shown tbody tag requires browser rendering in order to be present. I think missing tbody tags are implicitly added by at least some browsers. It is missing from page-source view and what you get back from requests. I also see no custom js adding it; so your path from browser, where tbody present, fails when applied to soup object, where missing.

Examine the level above:

soup.select('body > center > table')

Works but no tbody in visible html.

With tbody in selector:

soup.select('body > center > table > tbody')

[] i.e. empty list returned

Copied path without tbody:

soup.select('body > center > table > tr > td:nth-child(2) > img')

Tada! A matched node.

CSS Selector not following ':nth-child' logic for bs4

2 Answers2