This was part of another question (Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. Now posting this part as a separate question.
In the code below, I'm pasting 3 example source URLs into the code and it works. But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a .csv file (let's call it 'urls.csv'). I would prefer to read directly from that file.
I think I know the basic structure of 'with open' (e.g. the way @bguest answered it below), but I'm having problems how to link that to the rest of the code, so that the rest continues to work. How can I replace the list of URLs with iterative reading of .csv, so that I'm passing the URLs correctly into the code?
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
toc = soup.find("div", id="toc")
def get_drivers():
data.append({
'url': url,
'type': 'driver',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
})
get_drivers()
def get_challenges():
data.append({
'url': url,
'type': 'challenges',
'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
})
get_challenges()
pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
axis=1).to_csv('output.csv')