I am scraping the following webcomic site with requests and bs4 to download the comic image: www.qwantz.com
In the browser inspector when I select the webcomic element and copy the CSS Selector, I get the following:
comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(2) > img')
Looking at the html for the site, this makes sense. The elements in that section are aligned as such:
.\
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
<tr>
.\
However, this selector returns an empty list. When I back up the selector to just (... > td')
I get the three sibling elements in my selector object.
The following all results in empty lists as well, for each numerical argument I tried 1 - 2:
comicElem = soup.select('body > center > table > tbody > tr > td:nth-child(1)')
comicElem = soup.select('body > center > table > tbody > tr > td')[1]
Using comicElem = soup.select('body > center > table > tbody > tr > td > td > img')
gets me the results that I want. But I would like to know what is happening here that fails the CSS selector copied from the web inspector. In short, I would like my code to work using the CSS selector copied from the browser inspector. e.g. with td:nth-child(2)
.
For reference, here is the relevant code:
#! python3
# scheduledWebComicDL.py - Downloads comics but first checks if there is
# an update before
import requests, os, bs4, threading
folderName = 'Web Comics'
os.makedirs(folderName, exist_ok=True) # store comics in folderName
def downloadQwantz():
# Web comic site to parse.
site = 'http://www.qwantz.com'
# Make the soup with requests & bs4.
res = requests.get(site)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Get image url.
comicElem = soup.select('body > center > table > tbody > tr > td > td > img')
comicUrlShort = comicElem[0].get('src')
comicUrl = site + '/' + comicUrlShort
# Confirm that there is an img url.
if comicElem == 0:
print('Could not find comic element at %s.' % (site))
# Begin download of img with url.
else:
checkAndDownload(comicUrl)
downloadQwantz()