Write multiple files inside for-loop

Question

I am trying to crawl several links, extract text found on <p> HTML tag and write output to different files. Each link should have its own output file. So far:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests

urls = ['https://link1',
        'https://link2']
url_list = list(urls)

#scrape elements
for url in urls:
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find_all('p')
    page = soup.getText()
for line in urls:
    with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1

If I change my code into this

for index, line in enumerate(urls):
    with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.

I've researched S/O for similar 1 answers but I can't figure it out.

score 1 · Accepted Answer · answered Apr 10 '21 at 12:04

I'm guessing you're on some sort of *nix system as the error has to do with / interpreted a part of the path.

So, you have to do something to name your files correctly or create the path you want to save the output.

Having said that, using the URL as a file name is not a great idea, because of the above error.

You could either replace the / with, say _ or just name your files differently.

Also, this:

urls = ['https://link1',
        'https://link2']

Is already a list, so no need for this:

url_list = list(urls)

And there's no need for two for loops. You can write to a file as you scrape the URLS from the list.

Here's the working code with some dummy website:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
from bs4 import BeautifulSoup

urls = ['https://lipsum.com/', 'https://de.lipsum.com/']

for url in urls:
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find("div", {"id": "Panes"}).find("p").getText()
    with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

You could also use your approach with enumerate():

import requests
from bs4 import BeautifulSoup

urls = ['https://lipsum.com/', 'https://de.lipsum.com/']

for index, url in enumerate(urls, start=1):
    response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(response.content, "html.parser")
    page = soup.find("div", {"id": "Panes"}).find("p").getText()
    with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
        outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

I am getting `AttributeError: 'NoneType' object has no attribute 'find'`. If I change ``find` to `find_all`, I get `AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?`. Also, I think there is a learning opportunity for me here; can you explain the line `page=`? — lynx, Apr 10 '21 at 12:11
You shouldn't be using this `page = soup.find("div", {"id": "Panes"}).find("p").getText()` on your `URLs`. I've used this just to illustrate the point. The error you're getting means there's no `div` with the `id` of `Panes` for *your* URLs. — baduker, Apr 10 '21 at 12:13

Write multiple files inside for-loop

1 Answers1