Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

Question

I got this code to almost work, despite much ignorance. Please help on the home run!

Problem 1: INPUT:

I have a long list of URLs (1000+) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.

Problem 2: OUTPUT:

The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).

Problem 3: OUTPUT:

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.

At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL. They are just lists:

Market drivers

Growing investment in fabs
Miniaturization of electronic products
Increasing demand for IoT devices

Market challenges

Rapid technological changes in semiconductor industry
Volatility in semiconductor industry
Impact of technology chasm Table Impact of drivers and challenges

My desired output for drivers is:

0	1	2	3
http/.../Global-Induction-Hobs-30196623/	Product innovations and new designs	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/	Demand for automated recruitment processes	Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/	Growing investment in fabs	Miniaturization of electronic products	Increasing demand for IoT devices

But instead I get:

0	1	2	3	4	5	6
http/.../Global-Induction-Hobs-30196623/	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/			Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
http/.../Global-Probe-Card-30196643/					Miniaturization of electronic products	Increasing demand for IoT devices

Are there always only 3 drivers and 3 challenges (no more and no less) ? — QHarr, Nov 26 '21 at 20:32
No, in the examples above there are 3, but it may be anything from 0 to about 7 — Michael Wiz, Nov 27 '21 at 07:29
And how should they be handled? Should each driver have its own column such that if the same driver occurs for >= 2 different requests, they are mapped to the same column? Or does it not matter and you simply add another column each time a new driver is encountered? — QHarr, Nov 27 '21 at 11:16
Thanks @QHarr. Each row is independent of all other rows, so the columns just get filled with drivers from the left (after the url). If there is 0 drivers, then just the url on the left and no columns get filled. If there are 2 drivers, then the consecutive 2 columns get filled. If there are 7, then 7 columns get filled. Etc. — Michael Wiz, Nov 27 '21 at 19:46

HedgeHog · Accepted Answer · 2021-11-27T21:36:49.300

Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

Output

url	type	0	1	2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	driver	Product innovations and new designs	Increasing demand for convenient home appliances with changes in lifestyle patterns	Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/	challenges	High cost limiting the adoption in the mass segment	Health hazards related to induction hobs	Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	driver	Demand for automated recruitment processes	Increasing demand for unified solutions for all HR functions	Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/	challenges	Threat from open-source software	High implementation and maintenance cost	Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	driver	Growing investment in fabs	Miniaturization of electronic products	Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/	challenges	Rapid technological changes in semiconductor industry	Volatility in semiconductor industry	Impact of technology chasm

Nicely done there. Does this work for uneven numbers of columns? i.e. cases where different numbers of drivers/challenges? — QHarr, Nov 27 '21 at 20:41
HedgeHog , thank you very much for this!!! Definitely solves the output issues. And I suppose having both driver sand challenges in one file rather than two, is just as good or perhaps even better. However, I still have that first question about reading the urls from a csv file. I think I know the basic structure of 'while open', but I'm having problems how to link that to the rest of the code. Your help (or @QHarr 's ) would be highly appreciated. — Michael Wiz, Nov 27 '21 at 21:22
@QHarr: Thanks, in my opinion it also works with different lengths of drivers and challenges, it will be handled by the split and concat. Less columns become 'None' — HedgeHog, Nov 27 '21 at 21:40
@MichaelWiz: Happy to help - Concerning reding urls from csv, please ask a new question. Thanks — HedgeHog, Nov 27 '21 at 21:41
Thank you! Just posted it. In the meantime I have tested your code and it works perfectly. (I also tested for @QHarr 's example). Your code is so much more efficient than my messy attempt! Plus I think I understand 80% of what you've done here - the last 20% will take a while for me to study, though. https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas — Michael Wiz, Nov 27 '21 at 22:02

Reading URLs from .csv and appending scrape results below previous with Python, BeautifulSoup, Pandas

1 Answers1

Example

Output

Linked