I've just written a little scraper using BeautifulSoup and Jupyter notebook. I wanted to get certain links from a web page, loop through those links and extract articles, and save those articles as .txt files using the link texts as their names. After trials and errors, I got the scraper working, but I don't understand something.
from bs4 import BeautifulSoup as bs
import requests
url = 'https://chhouk-krohom.com/'
response = requests.get(url)
soup = bs(response.content, 'html.parser')
contents = soup.select('p[style="text-align:justify;"] a')
for content in contents:
part = content.text
link = content['href']
for l in link:
s = bs(requests.get(link).content, 'html.parser')
main = s.article.text
file_name = part # I don't understand here
with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
file.write(str(main))
As you can see, there are two loops here. The first loop is to get link texts (part) and links (link). The second loop is to follow each link and extract articles (main). The file_name is under the second loop, but it uses the value (part) from the first loop. However, it still matches the link texts with the articles under the second loop (why?). As a result, I got .txt files with names based on their link names as intended.
Please also correct my code if it needs further improvements. Thank you very much for your time.