1

I am scraping the following webcomic site with requests and bs4 to download the comic image: www.qwantz.com

In the browser inspector when I select the webcomic element and copy the CSS Selector, I get the following:

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(2) > img')

Looking at the html for the site, this makes sense. The elements in that section are aligned as such:

.\
  <tr>
    <td>...</td>
    <td>...</td>
    <td>...</td>
  <tr>
.\

However, this selector returns an empty list. When I back up the selector to just (... > td') I get the three sibling elements in my selector object.

The following all results in empty lists as well, for each numerical argument I tried 1 - 2:

comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(1)')
comicElem = soup.select('body > center > table > tbody > tr > td')[1]

Using comicElem = soup.select('body > center > table > tbody > tr > td > td > img') gets me the results that I want. But I would like to know what is happening here that fails the CSS selector copied from the web inspector. In short, I would like my code to work using the CSS selector copied from the browser inspector. e.g. with td:nth-child(2).

For reference, here is the relevant code:

#! python3
# scheduledWebComicDL.py - Downloads comics but first checks if there is
# an update before

import requests, os, bs4, threading

folderName = 'Web Comics'
os.makedirs(folderName, exist_ok=True) # store comics in folderName

def downloadQwantz():
    # Web comic site to parse.
    site = 'http://www.qwantz.com'

    # Make the soup with requests & bs4.
    res = requests.get(site)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Get image url.
    comicElem = soup.select('body > center > table > tbody > tr > td > td > img')
    comicUrlShort = comicElem[0].get('src')
    comicUrl = site + '/' + comicUrlShort

    # Confirm that there is an img url.
    if comicElem == 0:
        print('Could not find comic element at %s.' % (site))

    # Begin download of img with url.
    else:
            checkAndDownload(comicUrl)

downloadQwantz()

Cambuchi
  • 102
  • 1
  • 9
  • Can you provide the full code to reproduce this error? – QHarr Dec 30 '20 at 07:25
  • full code has been provided, along with more specifically what I want to achieve. – Cambuchi Dec 30 '20 at 15:49
  • I've voted to re-open. I think you only need the code relating to downloadQwant() and you can simplify that to only a few lines to demonstrate the problem. It may be bad html tripping up the parser. – QHarr Dec 30 '20 at 19:26
  • 1
    Thank you for the feedback. Relatively new to this. – Cambuchi Dec 30 '20 at 22:41

2 Answers2

1

The problem is the shown tbody tag requires browser rendering in order to be present. I think missing tbody tags are implicitly added by at least some browsers. It is missing from page-source view and what you get back from requests. I also see no custom js adding it; so your path from browser, where tbody present, fails when applied to soup object, where missing.

Examine the level above:

soup.select('body > center > table')

Works but no tbody in visible html.

With tbody in selector:

soup.select('body > center > table > tbody')

[] i.e. empty list returned

Copied path without tbody:

soup.select('body > center > table > tr > td:nth-child(2) > img')

Tada! A matched node.


See also:

Why do browsers still inject <tbody> in HTML5?

QHarr
  • 83,427
  • 12
  • 54
  • 101
0

All you need is this:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get("https://qwantz.com/").text, "html.parser")
comic_src = f"https://qwantz.com/{soup.select_one('.comic')['src']}"

print(comic_src)

with open(comic_src.rsplit("/")[-1], "wb") as f:
    f.write(requests.get(comic_src).content)

Output:

comics/comic2-2162.png

And a comic image in your local folder.

baduker
  • 19,152
  • 9
  • 33
  • 56
  • I know that sorting by class would get me there. However I am looking for to achieve the same results using the CSS selector copied from the web inspector. – Cambuchi Dec 30 '20 at 15:50