0

I am trying to learn how to pull data from this url: https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview

However, the problem is that the URL doesn't change when I am trying to switch pages so I am not exactly sure how to enumerate or loop through it. Trying to find a better way since the webpage has 3 thousand datapoints of sales.

Here is my starting code it is very simple but I would appreciate any help that can be given or any hints. I think I might need to change to another package but I am not sure which one maybe beautifulsoup?

import requests 
url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"

html = requests.get(url).content
df_list = pd.read_html(html,header = 1)[0]
df_list = df_list.drop([0,1,2]) #Drop unnecessary rows 
James Ho
  • 13
  • 1
  • Does this answer your question? [Scrape a dynamic website](https://stackoverflow.com/questions/206855/scrape-a-dynamic-website) – gre_gor Aug 31 '22 at 21:39

1 Answers1

1

To get the data from more pages you can use this example:

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = {
    "folder": "auctionResults",
    "loginID": "00",
    "pageNum": "1",
    "orderBy": "AdvNum",
    "orderDir": "asc",
    "justFirstCertOnGroups": "1",
    "doSearch": "true",
    "itemIDList": "",
    "itemSetIDList": "",
    "interest": "",
    "premium": "",
    "itemSetDID": "",
}

url = "https://denver.coloradotaxsale.com/index.cfm?folder=auctionResults&mode=preview"


all_data = []

for data["pageNum"] in range(1, 3):  # <-- increase number of pages here.
    soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
    for row in soup.select("#searchResults tr")[2:]:
        tds = [td.text.strip() for td in row.select("td")]
        all_data.append(tds)

columns = [
    "SEQ NUM",
    "Tax Year",
    "Notices",
    "Parcel ID",
    "Face Amount",
    "Winning Bid",
    "Sold To",
]

df = pd.DataFrame(all_data, columns=columns)

# print last 10 items from dataframe:
print(df.tail(10).to_markdown())

Prints:

SEQ NUM Tax Year Notices Parcel ID Face Amount Winning Bid Sold To
96 000094 2020 00031-18-001-000 $905.98 $81.00 00005517
97 000095 2020 00031-18-002-000 $750.13 $75.00 00005517
98 000096 2020 00031-18-003-000 $750.13 $75.00 00005517
99 000097 2020 00031-18-004-000 $750.13 $75.00 00005517
100 000098 2020 00031-18-007-000 $750.13 $76.00 00005517
101 000099 2020 00031-18-008-000 $905.98 $84.00 00005517
102 000100 2020 00031-19-001-000 $1,999.83 $171.00 00005517
103 000101 2020 00031-19-004-000 $1,486.49 $131.00 00005517
104 000102 2020 00031-19-006-000 $1,063.44 $96.00 00005517
105 000103 2020 00031-20-001-000 $1,468.47 $126.00 00005517
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91