2

I got this code to almost work, despite much ignorance. Please help on the home run!

  • Problem 1: INPUT:

I have a long list of URLs (1000+) to read from and they are in a single column in .csv. I would prefer to read from that file than to paste them into code, like below.

  • Problem 2: OUTPUT:

The source files actually have 3 drivers and 3 challenges each. In a separate python file, the below code finds, prints and saves all 3, but not when I'm using this dataframe below (see below - it only saves 2).

  • Problem 3: OUTPUT:

I want the output (both files) to have URLs in column 0, and then drivers (or challenges) in the following columns. But what I've written here (probably the 'drop') makes them not only drop one row but also move across 2 columns.

At the end I'm showing both the inputs and the current & desired output. Sorry for the long question. I'll be very grateful for any help!

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
dataframes = []
dataframes2 = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data = []
        for x in toc.select('li:-soup-contains-own("Market drivers") li'):
            data.append(x.get_text(strip=True))
        df = pd.DataFrame(data, columns=[url])
        dataframes.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes)
        tdata = df2.T
        tdata.to_csv(f'detail-dr.csv', header=True)

    get_drivers()


    def get_challenges():
        data = []
        for y in toc.select('li:-soup-contains-own("Market challenges") li'):
            data.append(y.get_text(strip=True).replace('Table Impact of drivers and challenges', ''))
        df = pd.DataFrame(data, columns=[url])
        dataframes2.append(pd.DataFrame(df).drop(0, axis=0))
        df2 = pd.concat(dataframes2)
        tdata = df2.T
        tdata.to_csv(f'detail-ch.csv', header=True)

    get_challenges()

The inputs look like this in each URL. They are just lists:

Market drivers

  • Growing investment in fabs
  • Miniaturization of electronic products
  • Increasing demand for IoT devices

Market challenges

  • Rapid technological changes in semiconductor industry
  • Volatility in semiconductor industry
  • Impact of technology chasm Table Impact of drivers and challenges

My desired output for drivers is:

0 1 2 3
http/.../Global-Induction-Hobs-30196623/ Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices

But instead I get:

0 1 2 3 4 5 6
http/.../Global-Induction-Hobs-30196623/ Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
http/.../Global-Human-Capital-Management-30196628/ Increasing demand for unified solutions for all HR functions Increasing workforce diversity
http/.../Global-Probe-Card-30196643/ Miniaturization of electronic products Increasing demand for IoT devices
Michael Wiz
  • 173
  • 7
  • 1
    Are there always only 3 drivers and 3 challenges (no more and no less) ? – QHarr Nov 26 '21 at 20:32
  • No, in the examples above there are 3, but it may be anything from 0 to about 7 – Michael Wiz Nov 27 '21 at 07:29
  • 2
    And how should they be handled? Should each driver have its own column such that if the same driver occurs for >= 2 different requests, they are mapped to the same column? Or does it not matter and you simply add another column each time a new driver is encountered? – QHarr Nov 27 '21 at 11:16
  • Thanks @QHarr. Each row is independent of all other rows, so the columns just get filled with drivers from the left (after the url). If there is 0 drivers, then just the url on the left and no columns get filled. If there are 2 drivers, then the consecutive 2 columns get filled. If there are 7, then 7 columns get filled. Etc. – Michael Wiz Nov 27 '21 at 19:46

1 Answers1

3

Store your data in a list of dicts, create a data frame from it. Split the list of drivers / challenges into single columns and concat it to the final data frame.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

urls = ['https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/', 'https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/']
data = []

for url in urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    toc = soup.find("div", id="toc")

    def get_drivers():
        data.append({
            'url':url,
            'type':'driver',
            'list':[x.get_text(strip=True) for x in toc.select('li:-soup-contains-own("Market drivers") li')]
        })

    get_drivers()


    def get_challenges():
        data.append({
            'url':url,
            'type':'challenges',
            'list':[x.text.replace('Table Impact of drivers and challenges','') for x in toc.select('li:-soup-contains-own("Market challenges") ul li') if x.text != 'Table Impact of drivers and challenges']
        })

    get_challenges()

    
pd.concat([pd.DataFrame(data)[['url','type']], pd.DataFrame(pd.DataFrame(data).list.tolist())],axis = 1)#.to_csv(sep='|')

Output

url type 0 1 2
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ driver Product innovations and new designs Increasing demand for convenient home appliances with changes in lifestyle patterns Growing adoption of energy-efficient appliances
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Induction-Hobs-30196623/ challenges High cost limiting the adoption in the mass segment Health hazards related to induction hobs Limitation of using only flat - surface utensils and induction-specific cookware
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ driver Demand for automated recruitment processes Increasing demand for unified solutions for all HR functions Increasing workforce diversity
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Human-Capital-Management-30196628/ challenges Threat from open-source software High implementation and maintenance cost Threat to data security
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ driver Growing investment in fabs Miniaturization of electronic products Increasing demand for IoT devices
https://www.marketresearch.com/Infiniti-Research-Limited-v2680/Global-Probe-Card-30196643/ challenges Rapid technological changes in semiconductor industry Volatility in semiconductor industry Impact of technology chasm
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • 1
    Nicely done there. Does this work for uneven numbers of columns? i.e. cases where different numbers of drivers/challenges? – QHarr Nov 27 '21 at 20:41
  • HedgeHog , thank you very much for this!!! Definitely solves the output issues. And I suppose having both driver sand challenges in one file rather than two, is just as good or perhaps even better. However, I still have that first question about reading the urls from a csv file. I think I know the basic structure of 'while open', but I'm having problems how to link that to the rest of the code. Your help (or @QHarr 's ) would be highly appreciated. – Michael Wiz Nov 27 '21 at 21:22
  • 2
    @QHarr: Thanks, in my opinion it also works with different lengths of drivers and challenges, it will be handled by the split and concat. Less columns become 'None' – HedgeHog Nov 27 '21 at 21:40
  • 1
    @MichaelWiz: Happy to help - Concerning reding urls from csv, please ask a new question. Thanks – HedgeHog Nov 27 '21 at 21:41
  • Thank you! Just posted it. In the meantime I have tested your code and it works perfectly. (I also tested for @QHarr 's example). Your code is so much more efficient than my messy attempt! Plus I think I understand 80% of what you've done here - the last 20% will take a while for me to study, though. https://stackoverflow.com/questions/70139037/reading-list-of-urls-from-csv-for-scraping-with-python-beautifulsoup-pandas – Michael Wiz Nov 27 '21 at 22:02