5

Im having a problem with scraping the table of this website, I should be getting the heading but instead am getting

AttributeError: 'NoneType' object has no attribute 'tbody'

Im a bit new to web-scraping so if you could help me out that would be great

import requests
from bs4 import BeautifulSoup

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL)
soup = BeautifulSoup(page.content, "lxml")

table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")

headings = []
for td in table_data[0].find_all("td"):
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)
Oleksii Tambovtsev
  • 2,666
  • 1
  • 3
  • 21
Achilles
  • 46
  • 7

3 Answers3

2

If you look at page.content, you will see that "Your IP address has been blocked". enter image description here

You should add some headers to your request because the website is blocking your request. In your specific case, it will be enough to add a User-Agent:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "lxml")

table = soup.find("table", id="propertysearchresults")
table_data = table.tbody.find_all("tr")

headings = []
for td in table_data[0].find_all("td"):
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)

If you add headers, you will still have error, but in the row:

headings.append(td.b.text.replace('\n', ' ').strip())

You should change it to

headings.append(td.text.replace('\n', ' ').strip())

because td doesn't always have b.

Oleksii Tambovtsev
  • 2,666
  • 1
  • 3
  • 21
2
import requests
from bs4 import BeautifulSoup

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")

Finding Table Data:

column_data=soup.find("table").find_all("tr")[0]
column=[i.get_text() for i in column_data.find_all("td") if i.get_text()!=""]

row=soup.find("table").find_all("tr")[1:]
main_lst=[]
for row_details in row:
    lst=[]
    for i in row_details.find_all("td")[1:]:
        if i.get_text()!="":
            lst.append(i.get_text())
    main_lst.append(lst)

Converting to pandas DataFrame:

import pandas as pd
df=pd.DataFrame(main_lst,columns=column)

Output:

Property ID↓ Geographic ID ↓    Owner Name  Property Address    Legal Description   2021 Market Value
0   2709013R-10644-00H-0010-1   PARTHASARATHY SURESH & ANITHA HARIKRISHNAN  12209 Willowgate DrFrisco, TX  75035    Ridgeview At Panther Creek Phase 2, Blk H, Lot 1    $513,019
.....
Bhavya Parikh
  • 3,304
  • 2
  • 9
  • 19
2

What happens?

Note: Always look at your soup first - therein lies the truth. The content can always be slightly to extremely different from the view in the dev tools.

Access Revoked

Your IP address has been blocked.

We detected irregular, bot-like usage of our Property Search originating from your IP address. This block was instated to reduce stress on our webserver, to ensure that we're providing optimal site performance to the taxpayers of Collin County.

We have not blocked your ability to download our data exports, which you can still use to acquire bulk property data.

How to fix?

Add a user-agent to your requets so that it looks like your requesting with a "browser".

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
page = s.get(URL,headers=headers)

Or as alternativ just download the results.

Example (scraping table)

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}

URL = "https://www.collincad.org/propertysearch?situs_street=Willowgate&situs_street_suffix" \
      "=&isd%5B%5D=any&city%5B%5D=any&prop_type%5B%5D=R&prop_type%5B%5D=P&prop_type%5B%5D=MH&active%5B%5D=1&year=2021&sort=G&page_number=1"

s = requests.Session()

page = s.get(URL,headers=headers)
soup = BeautifulSoup(page.content, "lxml")

data = []
for row in soup.select('#propertysearchresults tr'):
    data.append([c.get_text(' ',strip=True) for c in row.select('td')])

pd.DataFrame(data[1:], columns=data[0])

Output

Property ID ↓ Geographic ID ↓ Owner Name Property Address Legal Description 2021 Market Value
1 2709013 R-10644-00H-0010-1 PARTHASARATHY SURESH & ANITHA HARIKRISHNAN 12209 Willowgate Dr Frisco, TX\xa0 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 1 $513,019
2 2709018 R-10644-00H-0020-1 JOSHI PRASHANT & SHWETA PANT 12235 Willowgate Dr Frisco, TX\xa0 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 2 $546,254
3 2709019 R-10644-00H-0030-1 THALLAPUREDDY RAVENDRA & UMA MAHESWARI VEMULA 12261 Willowgate Dr Frisco, TX\xa0 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 3 $550,768
4 2709020 R-10644-00H-0040-1 KULKARNI BHEEMSEN T & GOURI R 12287 Willowgate Dr Frisco, TX\xa0 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 4 $509,593
5 2709021 R-10644-00H-0050-1 BALAM GANESH & SHANTHIREKHA LOKULA 12313 Willowgate Dr Frisco, TX\xa0 75035 Ridgeview At Panther Creek Phase 2, Blk H, Lot 5 $553,949

...

HedgeHog
  • 22,146
  • 4
  • 14
  • 36