Why I cannot scrape all the data from Zillow?

Question

I'm trying to scrape the data from Zillow (prices) as a practice with Python and I'm not getting the data complete.

This is my code

from jobEntryBot import JobEntryBot
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from pprint import pprint
import time
import requests
URL_ZILLOW = r"https://www.zillow.com/homes/for_rent/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22mapBounds%22%3A%7B%22west%22%3A-123.4663871665039%2C%22east%22%3A-121.7744926352539%2C%22south%22%3A37.03952097286371%2C%22north%22%3A38.19687379258651%7D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22price%22%3A%7B%22max%22%3A872627%7D%2C%22beds%22%3A%7B%22min%22%3A1%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%2C%22mp%22%3A%7B%22max%22%3A3000%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22fr%22%3A%7B%22value%22%3Atrue%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A9%7D"

header = {
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/511.22 (KHTML, like Gecko) Chrome/139.3.3.3 Safari/312.311',
    'Accept-Language': 'en-US,en;q=0.9'
}
data = requests.get(headers=header, url=URL_ZILLOW)

soup = BeautifulSoup(data.text, "html.parser")
selector_for_prices = ".gMDnGj span"
prices = soup.select(selector_for_prices)
for price in prices:
    print(price.text)

I try this but **only get 9 prices ** not all the 40 something prices on the webpage.

enter image description here

I've tried using other functions like soup.find_all() but it doesn't work. I've tried even using selenium. If I inspect the Zillow page and use the selector I use in the code it works but not in my code. Pd: I changed the user_agent for the code I show fyi

The cards are rendered based on the scroll height. To get the rest of them you will have to scroll all the way down. — Übermensch, Feb 07 '23 at 21:30
Also, is that correct css selector? Aren't the prices in the `span` tag with `data-test="property-card-price"`. — Übermensch, Feb 07 '23 at 21:33

Übermensch · Accepted Answer · 2023-02-08T17:18:52.187

1

Since the website has web-detection capabilities, you will first need find a way to avoid detection. This post contains a comprehensive list of methods to avoid detection.

It may also be worth looking into the APIs Zillow offers, as it does not seem like there will be a simple way to scrape their website. But if your just doing fun or as a personal learning experience, then it definitely worth take some time to figure out the best approach to scrape Zillow.

edited Feb 08 '23 at 17:18

answered Feb 07 '23 at 21:44

Übermensch

318
2
11

How did u get to that conclusion? I read that maybe the problem is solved with a rotating residential proxy because using selenium Zillow responds with a catpcha of "press and hold". I appretiate the help – Mr-Sepi0l Feb 08 '23 at 03:22
Just ran `driver.get(url)` a couple times and I got the captcha test, so my answer won't work. A work around like the one you mentioned might work, but I recommend first figuring out how the website is detecting bots. Knowing that will help you determine the best approach to avoid detection. – Übermensch Feb 08 '23 at 16:34
Also, it may be helpful looking at the requests the website makes. You may be able to get the data you want by making the appropriate requests yourself. This [article](https://scrapecrow.com/reverse-engineering-intro.html) goes over the basics of doing so in Chrome. – Übermensch Feb 08 '23 at 16:58

Why I cannot scrape all the data from Zillow?

1 Answers1