-3

I've just written a little scraper using BeautifulSoup and Jupyter notebook. I wanted to get certain links from a web page, loop through those links and extract articles, and save those articles as .txt files using the link texts as their names. After trials and errors, I got the scraper working, but I don't understand something.

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/'
response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.select('p[style="text-align:justify;"] a')
for content in contents:
    part = content.text
    link = content['href']
    
    for l in link:
        s = bs(requests.get(link).content, 'html.parser')
        main = s.article.text
        file_name = part # I don't understand here
        with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
            file.write(str(main))

As you can see, there are two loops here. The first loop is to get link texts (part) and links (link). The second loop is to follow each link and extract articles (main). The file_name is under the second loop, but it uses the value (part) from the first loop. However, it still matches the link texts with the articles under the second loop (why?). As a result, I got .txt files with names based on their link names as intended.

Please also correct my code if it needs further improvements. Thank you very much for your time.

  • Assuming by second loop you meant the inner loop, nothing never modified `part` so remains unchanged the entire time. I am not sure why you would think it gets magically changed to something else. When in doubt, run your code with a debugger and step through each execution step and inspect every suspect variable to help yourself clear things up. – metatoaster Jun 18 '22 at 09:00
  • Thank you very much. Now I just got ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). I got some txt files saved. I want to run the scraper again. Will it overwrite the existing files or skip them? If it will overwrite the existing files, is there a way to make it skip the existing files? – Mortus Pect Jun 18 '22 at 09:18
  • If you searched for a solution you may come across [this thread](https://stackoverflow.com/questions/54634571/create-new-files-dont-overwrite-existing-files-in-python). – metatoaster Jun 18 '22 at 09:22
  • There seems to be no solution for this. Either overwrite or append. – Mortus Pect Jun 18 '22 at 09:48
  • My problem was I wanted to skip writing the same files again. The names were assigned based the file links. There were totally 110 txt files to write (in the for loop). Due to connection error, only around 40 txt files were scraped and written. I was seeking a solution for me to just skip writing the existing files and continue to write the rest. It was something to do with the loop, I guess. All the answers I've read didn't seem to apply in my case. – Mortus Pect Jun 19 '22 at 04:51
  • Then wouldn't just have a condition where you check for the existence of the target file using [`os.path.exists`](https://docs.python.org/3/library/os.path.html#os.path.exists) before attempting to open the target file for writing? You need to figure out the exact logic you want to approach this with the tools you have. – metatoaster Jun 19 '22 at 06:15
  • Thanks again for your help. As I added headers, the scraping was a success (albeit starting all over again). However, for future disconnection problems, I'll dig into what you've suggested. – Mortus Pect Jun 19 '22 at 06:28

1 Answers1

1

As suggested, you should run this with a debugger, or add some print statements in your code so you can see what is happening at each line/part of the code.

If you do that, you will see when you run the code, link = content['href'], so 'https://chhouk-krohom.com/%E1%9E%98%E1%9E%A0%E1%9E%B6%E1%9E%9C%E1%9E%B7%E1%9E%97%E1%9E%84%E1%9F%92%E1%9E%82%E1%9F%A1/' is stored as link.

You are iterating over a string with for l in link:. So the first iteration l is 'h' (the first character in the string. So then it's trying to do s = bs(requests.get('h').content, 'html.parser') which isn't a valid url. So what it's doing is iterating l stored as a 'h', then 't'. then 't', 'p', 's', ':', '/', '/', 'c', 'h', etc.....

What you want to do is first get the contents. From each contents, find all the <a> tags with a href (those are the links). Then iterate through that list of links.

from bs4 import BeautifulSoup as bs
import requests

url = 'https://chhouk-krohom.com/'
response = requests.get(url)
soup = bs(response.content, 'html.parser')

contents = soup.select('p[style="text-align:justify;"]')
for content in contents:
    links = content.find_all('a', href=True)
    
    for link in links:
        part = link.text
        url_link = link['href']
    
        s = bs(requests.get(url).content, 'html.parser')
        main = s.article.text
        file_name = part # I don't understand here
        with open('./{}.txt'.format(file_name), mode='wt', encoding='utf-8') as file:
            file.write(str(main))
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • Thank you very much for your input. One more thing is how do I use .find() or .find_all() like CSS selector? Some sites have weird structures that I need to include many anchor tags in the navigation tree. For example, I can use CSS selector like .select('article div p.title span'). However, I have no idea to do that with .find(). I've seen it's used with only one tag and attribute. – Mortus Pect Jun 19 '22 at 05:03
  • Read [here](https://stackoverflow.com/questions/38028384/beautifulsoup-difference-between-find-and-select) – chitown88 Jun 19 '22 at 07:45