I am trying to crawl several links, extract text found on <p>
HTML tag and write output to different files. Each link should have its own output file. So far:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1
If I change my code into this
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.
I've researched S/O for similar 1 answers but I can't figure it out.