Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

Question

This was part of another question (Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas ) which was generously answered by @HedgeHog and contributed to by @QHarr. Now posting this part as a separate question.

In the code below, I'm pasting 3 example source URLs into the code and it works. But I have a long list of URLs (1000+) to scrape and they are stored in a single first column of a .csv file (let's call it 'urls.csv'). I would prefer to read directly from that file.

I think I know the basic structure of 'with open' (e.g. the way @bguest answered it below), but I'm having problems how to link that to the rest of the code, so that the rest continues to work. How can I replace the list of URLs with iterative reading of .csv, so that I'm passing the URLs correctly into the code?

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")


    def get_drivers():
        data.append({
            'url': url,
            'type': 'driver',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })


    get_drivers()


    def get_challenges():
        data.append({
            'url': url,
            'type': 'challenges',
            'list': [x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if
                     'Table Impact of drivers and challenges' not in x.get_text(strip=True)]
        })


    get_challenges()

pd.concat([pd.DataFrame(data)[['url', 'type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],
          axis=1).to_csv('output.csv')

Please [edit] your question and post exactly what your specific question is, as a single interrogatory statement. Do not rely on information in another question, as questions on Stack Overflow must be self-contained. Include the link to the other question *only for reference*. Once you have stated your question, please post a [mre] in code clearly demonstrating your problem, and what you've tried to do so far to solve it. If your question is just *How do I read a list of URLs from a CSV?*, then please search some more for the answer, as that has been addressed many many times on this site. — MattDMo, Nov 27 '21 at 22:07
@MattDMo , thank you for your notes. Indeed there is no need to read another question for this, but I wanted to recognize the role of others in getting to this place - the quoted code is clearly not all mine. I think showing what I've attempted so far will only confuse, because it clearly doesn't work - I'm missing a step in linking the reading of csv with the remainder and showing two incompatible codes won't help. Sorry if my understanding of the rules is very 'newbie'. — Michael Wiz, Nov 27 '21 at 22:57

bguest · Accepted Answer · 2021-11-27T22:45:27.770

4

Since you're using pandas, read_csv will do the trick for you: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

If you want to write it on your own, you could use the built in csv library

import csv

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row["url"])

Edit: was asked how to make the rest of the code use the urls from the csv.

First put the urls in a urls.csv file

url
https://www.google.com
https://www.facebook.com

Now gather the urls from the csv

import csv

with open('urls.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)

    urls = [row["url"] for row in reader]

# remove the following lines
urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/',
        'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']

Now the urls should be used by the rest of the code

edited Nov 27 '21 at 22:45

answered Nov 27 '21 at 22:08

bguest

215
1
7

thank you very much for this. I guess I need to import csv. Now, my problem is actually how to use what this is reading in the rest of the code. I'm not printing it at this stage, but passing the urls, one by one, to the rest of the code. And my problem is mainly not knowing how to get the rest of the code to use the urls and work with them. Would you have a suggestion for me? – Michael Wiz Nov 27 '21 at 22:38
1

answering your question in my answer, give me a few minutes – bguest Nov 27 '21 at 22:41
thanks so much for this. I still had two little conundrums after implementing your code: where exactly data=[] should go, and whether pd.concat at the end should start with an indent or not. Well, I tried each of about 6 combinations and I see that it works perfectly with data=[] coming just before "for url in urls:" and with no indent in the last line (but one more indent in all of my other code). So thank you very much! What I was really missing was this part "urls = [row["url"] for row in reader]" - basically a translation between the csv input and the rest of the code! – Michael Wiz Nov 27 '21 at 23:34
data=[] should go inside the beginning of the for loop if you are keeping the code the way it is, so that you aren't adding data you've already added before – bguest Nov 27 '21 at 23:38
hi @bguest and thanks for the answer. I have a similar issue with this code https://pastecode.io/s/hno20ip0 , with this csv file https://pastecode.io/s/nkaazee9 . I'm getting this error `urls = [row["url"] for row in reader] KeyError: 'url'` . Any idea of what's wrong with the KeyError? The original code (with the hardcoded urls) is here: https://stackoverflow.com/a/71115244/10789707 Thanks. – Lod Feb 15 '22 at 13:00
Found the `KeyError: 'url'` cause (UTF-8 with BOM encoding) from this answer https://stackoverflow.com/a/34399309/10789707 . Solution details at https://stackoverflow.com/questions/71141395/why-keyerror-url-occur-while-reading-urls-list-from-a-csv-file-with-python/71141444#71141444 – Lod Feb 16 '22 at 12:03
1

Glad you found your issue :) Quite busy these days so I don't see notifications often @Lod – bguest Feb 16 '22 at 19:48
Thanks again a lot for your answer @bguest! Be well! – Lod Feb 16 '22 at 20:05

Reading list of URLs from .csv for scraping with Python, BeautifulSoup, Pandas

1 Answers1

Linked