I am trying to get all the image urls for all the books on this page https://www.nb.co.za/en/books/0-6-years
with beautiful soup.
This is my code:
from bs4 import BeautifulSoup
import requests
baseurl = "https://www.nb.co.za/"
productlinks = []
r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")
def my_filter(tag):
return (tag.name == 'a' and
tag.parent.name == 'div' and
'img-container' in tag.parent['class'])
for item in productlist:
for link in item.find_all(my_filter, href=True):
productlinks.append(baseurl + link['href'])
cover = soup.find_all('div', class_="img-container")
print(cover)
And this is my output:
<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>
What I hope to get:
https://www.nb.co.za/en/helper/ReadImage/25929.jpg
My problem is:
How do I get the data-sourcre only?
How to I get the extension of the image?