Python is giving me both columns of a table I a scraping, but I only want it to give me one of the columns

Question

I am using Python to scrape the names of the Alaska Supreme Court justices from Ballotpedia (https://ballotpedia.org/Alaska_Supreme_Court). My current code is giving me both the names of the justices as well as the names of the persons in the "Appointed by" column. Here is my current code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/Alaska_Supreme_Court']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]

df = pd.DataFrame.from_dict(temp_dict, 
orient='index').transpose()
df.to_csv('18-TEST.csv')

I've been trying to work with this line:

temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")]

I'm a little inexperienced using the inspect function on webpages, so I may be trying the wrong thing when I try to put "tr" or "td" (which I am finding under "tbody") after "tablesorter". I'm a bit lost at this point and am having trouble finding resources on this. Would you be able to help me to get python to give me the judge column but not the appointed by column? Thank you!

HedgeHog · Answer 1 · 2021-03-16T18:28:37.063

There are different options to get the result.

Option#1

Slice the list and pick every second element:

soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]

Example:

import requests
from bs4 import BeautifulSoup
import pandas as pd

lst = ['https://ballotpedia.org/Alaska_Supreme_Court']

temp_dict = {}

for page in lst:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter a")][0::2]

pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)

Option#2

Make your selection more specific and select only the first td in a tr:

soup.select("table.wikitable.sortable.jquery-tablesorter  tr > td:nth-of-type(1)")]

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

list = ['https://ballotpedia.org/Alaska_Supreme_Court']

temp_dict = {}

for page in list:
    r = requests.get(page)
    soup = BeautifulSoup(r.content, 'html.parser')

    temp_dict[page.split('/')[-1]] = [item.text for item in soup.select("table.wikitable.sortable.jquery-tablesorter  tr > td:nth-of-type(1)")]

pd.DataFrame.from_dict(temp_dict, orient='index').transpose().to_csv('18-TEST.csv', index=False)

Option#3

Use pandas functionality read_html()

Example

import pandas as pd
df = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court')[2]
df.Judge.to_csv('18-TEST.csv', index=False)

score 1 · Answer 2 · answered Mar 16 '21 at 17:47

Firstly, please note that this is code cannibalised from here.

Now, if you don't know how many rows or columns you have, this gives you a dataframe with all the columns, corresponding to the table on the webpage. Feel free to drop one of the columns if you don't need it.

import requests
from bs4 import BeautifulSoup
import pandas as pd

# I'll do it for the one page example
page = 'https://ballotpedia.org/Alaska_Supreme_Court'

temp_dict = {}
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')

# this finds the first table with the class specified
table = soup.find('table', attrs={'class':'wikitable sortable jquery-tablesorter'})
# get all rows of the above table
rows = table.find_all('tr') 
data = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append([ele for ele in cols if ele])
# turn it into a pandas dataframe
df = pd.DataFrame(data)

Somitra Gupta · Accepted Answer · 2021-03-16T18:08:10.100

0

I would like to share another approach to get you table in desired format :

import pandas as pd
# extracting table and making it dataframe
frame = pd.read_html('https://ballotpedia.org/Alaska_Supreme_Court',attrs={"class":"wikitable sortable jquery-tablesorter"})[0]

# drop unwanted columns
frame.drop("Appointed By", axis=1, inplace=True)

# save dataframe as csv
frame.to_csv("desired/path/output.csv", index=False)

Printing frame would give output as : |Judge| |-----| | Daniel Winfree| | Joel Harold Bolger| | Peter Jon Maassen| | Susan Carney| | Dario Borghesan|

edited Mar 16 '21 at 18:08

answered Mar 16 '21 at 18:03

Somitra Gupta

81
5

Thanks! I am trying to transition what you gave me to a list now, but am having trouble when the csv is only showing the judges from the last element of the list. Here is what I am using: import pandas as pd list = ['https://ballotpedia.org/Alaska_Supreme_Court', 'https://ballotpedia.org/Ohio_Supreme_Court', for page in list: frame = pd.read_html(page,attrs={"class":"wikitable sortable jquery-tablesorter"})[0] frame.drop("Appointed By", axis=1, inplace=True) frame.to_csv("18-TEST.csv", index=False) – Mark Cicero Mar 16 '21 at 19:36
Thanks for all the help! – Mark Cicero Mar 16 '21 at 19:37

Python is giving me both columns of a table I a scraping, but I only want it to give me one of the columns

3 Answers3

Option#1

Option#2

Option#3