Tablescraping from a website with ID using beautifulsoup

Question

Im having a problem with scraping the table of this website, I should be getting the heading but instead am getting

AttributeError: 'NoneType' object has no attribute 'tbody'

Im a bit new to web-scraping so if you could help me out that would be great

import requests
from bs4 import BeautifulSoup

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL)
soup = BeautifulSoup(page.content, "lxml")

table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")

headings = []
for td in table_data[0].find_all("td"):
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)

Oleksii Tambovtsev · Answer 1 · 2021-12-29T16:04:08.210

If you look at page.content, you will see that "Your IP address has been blocked".

You should add some headers to your request because the website is blocking your request. In your specific case, it will be enough to add a User-Agent:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")

table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")

headings = []
for td in table_data[0].find_all("td"):
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)

If you add headers, you will still have error, but in the row:

headings.append(td.b.text.replace('\n', ' ').strip())

You should change it to

headings.append(td.text.replace('\n', ' ').strip())

because td doesn't always have b.

Thank you! I just noticed that after the code was only working on another device — Achilles, Dec 29 '21 at 22:56

score 2 · Answer 2 · answered Dec 29 '21 at 15:58

import requests
from bs4 import BeautifulSoup

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")

Finding Table Data:

column_data=soup.find("table").find_all("tr")[0]
column=[i.get_text() for i in column_data.find_all("td") if i.get_text()!=""]

row=soup.find("table").find_all("tr")[1:]
main_lst=[]
for row_details in row:
    lst=[]
    for i in row_details.find_all("td")[1:]:
        if i.get_text()!="":
            lst.append(i.get_text())
    main_lst.append(lst)

Converting to pandas DataFrame:

import pandas as pd
df=pd.DataFrame(main_lst,columns=column)

Output:

Property ID↓ Geographic ID ↓    Owner Name  Property Address    Legal Description   2021 Market Value
0   2709013R-10644-00H-0010-1   PARTHASARATHY SURESH & ANITHA HARIKRISHNAN  12209 Willowgate DrFrisco, TX  75035    Ridgeview At Panther Creek Phase 2, Blk H, Lot 1    $513,019
.....

HedgeHog · Accepted Answer · 2021-12-29T16:23:01.580

What happens?

Note: Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.

Access Revoked

Your IP address has been blocked.

We detected irregular, bot-like usage of our Property Search originating from your IP address. This block was instated to reduce stress on our webserver, to ensure that we're providing optimal site performance to the taxpayers of Collin County.

We have not blocked your ability to download our data exports, which you can still use to acquire bulk property data.

How to fix?

Add a user-agent to your requets so that it looks like your requesting with a "browser".

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
page = s.get(URL,headers=headers)

Or as alternativ just download the results.

Example (scraping table)

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")

data = []
for row in soup.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])

pd.DataFrame(data[1:], columns=data[0])

Output

	Property ID ↓ Geographic ID ↓	Owner Name	Property Address	Legal Description	2021 Market Value
1	2709013 R-10644-00H-0010-1	PARTHASARATHY SURESH & ANITHA HARIKRISHNAN	12209 Willowgate Dr Frisco, TX\xa0 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 1	$513,019
2	2709018 R-10644-00H-0020-1	JOSHI PRASHANT & SHWETA PANT	12235 Willowgate Dr Frisco, TX\xa0 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 2	$546,254
3	2709019 R-10644-00H-0030-1	THALLAPUREDDY RAVENDRA & UMA MAHESWARI VEMULA	12261 Willowgate Dr Frisco, TX\xa0 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 3	$550,768
4	2709020 R-10644-00H-0040-1	KULKARNI BHEEMSEN T & GOURI R	12287 Willowgate Dr Frisco, TX\xa0 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 4	$509,593
5	2709021 R-10644-00H-0050-1	BALAM GANESH & SHANTHIREKHA LOKULA	12313 Willowgate Dr Frisco, TX\xa0 75035	Ridgeview At Panther Creek Phase 2, Blk H, Lot 5	$553,949

...

Thank you so much! This fixed my issue. – Achilles Dec 29 '21 at 20:36 — Achilles, Dec 29 '21 at 20:36

Tablescraping from a website with ID using beautifulsoup

3 Answers3

What happens?

Access Revoked

How to fix?

Example (scraping table)

Output