Converting from HTML to CSV using Python

Question

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

url = 'https://www.privateequityinternational.com/database/#/pei-300'

page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table

Output: []

In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:

Refer this link, this may help - https://stackoverflow.com/questions/33633416/convert-html-table-to-csv-in-python — Swaroop Humane, Jul 29 '20 at 22:02

score 0 · Answer 1 · answered Jul 29 '20 at 22:24

The data is loaded from external URL via Ajax. You can use requests/json module to get it:

import json
import requests


url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for item in data['data']:
    print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))

Prints:

5611  Blackstone                     New York, United States
5579  The Carlyle Group              Washington DC, United States
5586  KKR                            New York, United States
6701  TPG                            Fort Worth, United States
5591  Warburg Pincus                 New York, United States
1801  NB Alternatives                New York, United States
6457  CVC Capital Partners           Luxembourg, Luxembourg
6477  EQT                            Stockholm, Sweden
6361  Advent International           Boston, United States
8411  Vista Equity Partners          Austin, United States
6571  Leonard Green & Partners       Los Angeles, United States
6782  Cinven                         London, United Kingdom
6389  Bain Capital                   Boston, United States
8096  Apollo Global Management       New York, United States
8759  Thoma Bravo                    San Francisco, United States
7597  Insight Partners               New York, United States
867   BlackRock                      New York, United States
5471  General Atlantic               New York, United States
6639  Permira Advisers               London, United Kingdom
5903  Brookfield Asset Management    Toronto, Canada
6473  EnCap Investments              Houston, United States
6497  Francisco Partners             San Francisco, United States
6960  Platinum Equity                Beverly Hills, United States
16331 Hillhouse Capital Group        Hong Kong, Hong Kong
5595  Partners Group                 Baar-Zug, Switzerland

score 0 · Answer 2 · answered Jul 29 '20 at 22:45

And selenium version:

from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'

driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))

prints:

    Ranking                         Name            City, Country (HQ)
0         1                   Blackstone       New York, United States
1         2            The Carlyle Group  Washington DC, United States
2         3                          KKR       New York, United States
3         4                          TPG     Fort Worth, United States
4         5               Warburg Pincus       New York, United States
5         6              NB Alternatives       New York, United States
6         7         CVC Capital Partners        Luxembourg, Luxembourg
7         8                          EQT             Stockholm, Sweden
8         9         Advent International         Boston, United States
9        10        Vista Equity Partners         Austin, United States
10       11     Leonard Green & Partners    Los Angeles, United States
11       12                       Cinven        London, United Kingdom
12       13                 Bain Capital         Boston, United States
13       14     Apollo Global Management       New York, United States
14       15                  Thoma Bravo  San Francisco, United States
15       16             Insight Partners       New York, United States
16       17                    BlackRock       New York, United States
17       18             General Atlantic       New York, United States
18       19             Permira Advisers        London, United Kingdom
19       20  Brookfield Asset Management               Toronto, Canada
20       21            EnCap Investments        Houston, United States
21       22           Francisco Partners  San Francisco, United States
22       23              Platinum Equity  Beverly Hills, United States
23       24      Hillhouse Capital Group          Hong Kong, Hong Kong
24       25               Partners Group         Baar-Zug, Switzerland

And also save data to a file.csv.

Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Converting from HTML to CSV using Python

2 Answers2