Trying to web scrape all the tables on a web page

Question

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests
from time import sleep
from random import randint
import re

towns = pd.DataFrame()

url = f"https://www.city-data.com/city/Adak-Alaska.html"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")

table_data = doc.findAll("td")
#for i in table_data:
   #towns.append(table_data[i])
print(table_data)

I'm trying to get the data from the tables, like numbers of adherents to certain religions, ethnic groups, etc. When I look at the source page all that stuff is between the td tags but I'm not seeing it when I print out table_data. What am I doing wrong?

Barry the Platipus · Answer 1 · 2022-07-17T15:22:07.393

0

import pandas as pd
from bs4 import BeautifulSoup
import numpy as np
import requests
from time import sleep
from random import randint
import re

towns = pd.DataFrame()

url = f"https://www.city-data.com/city/Adak-Alaska.html"
page = requests.get(url).text
doc = BeautifulSoup(page, "html.parser")

dfs = pd.read_html(page)
for x in dfs:
print(x) ## do what you will with the data

For instance, the religions would be table 17 (dfs[17]):

Religion    Adherents   Congregations
0   Orthodox    754 6
1   Evangelical Protestant  232 3
2   Catholic    185 1
3   Other   112 1
4   Mainline Protestant 82  1
5   None    4196    -

EDIT: Given the OP's insurmountable issues with his python install, a workaround would be:

url = "https://www.city-data.com/city/Adak-Alaska.html"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
for x in soup.select('table'):
    for z in x.select('tr'):
        print([y.text.strip() for y in z.find_all(['td', 'th'])])
    print('________________')

Results can be further transformed in dataframes.

edited Jul 17 '22 at 15:22

answered Jul 17 '22 at 13:53

Barry the Platipus

9,594
2
6
30

I just copied your code and it gave me an error. – Virtual Adept Jul 17 '22 at 14:07
What is the exact error you get? The code works. – Barry the Platipus Jul 17 '22 at 14:09
https://imgur.com/a/PsxfLWF – Virtual Adept Jul 17 '22 at 14:11
The standalone code I wrote above works. You screenshotted a different script. Open a new .py file and run it in an environment with the above packages installed - it will work. – Barry the Platipus Jul 17 '22 at 14:15
Sorry wrong screenshot – Virtual Adept Jul 17 '22 at 14:22
https://imgur.com/a/6Nt4atK – Virtual Adept Jul 17 '22 at 14:22
As per your error, you do not have lxml installed. You can install it with `pip install lxml` – Barry the Platipus Jul 17 '22 at 14:26
made a new project just to make sure, same errors https://imgur.com/a/gCT5F86 – Virtual Adept Jul 17 '22 at 14:31
I've already installed lxml twice – Virtual Adept Jul 17 '22 at 14:31
For your issue with lxml please see https://stackoverflow.com/questions/44954802/python-importerror-lxml-not-found-please-install-it – Barry the Platipus Jul 17 '22 at 14:33
Yeah I don't know, I followed all their suggestions and it is still not working – Virtual Adept Jul 17 '22 at 14:37
Did you also do a `pip install -U pandas` ? – Barry the Platipus Jul 17 '22 at 14:42
I reinstalled it, just to be safe, and it still doesn't work. – Virtual Adept Jul 17 '22 at 14:45
1

python module `lxml` is only wrapper on C/C++ library `libxml` and you may have to install it manually. See lxml doc: [installation](https://lxml.de/installation.html) – furas Jul 17 '22 at 16:17

Trying to web scrape all the tables on a web page

1 Answers1