Web scraping remax.com for python

Question

This is similar to the question I had here. Which was answered perfectly. Now that I have something to work with what I am trying to do now is instead of having a url entered manually in to take data. I want to develop a function that will take in just the address, and zipcode and return the data I want.

Now the problem is modifying the url to get the correct url. For example

url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'

I see that besides the address, state, and zipcode there is also a number that follows i.e. gid100012499996 which seems to be unique for each address. So I am not sure how to be able to achieve the function I want.

Here is my code:

import urllib
from bs4 import BeautifulSoup
import pandas as pd

def get_data(url):
    hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive'}
    request = urllib.request.Request(url, headers=hdr)
    html = urllib.request.urlopen(request).read()

    soup = BeautifulSoup(html,'html.parser')
    foot = soup.find('span', class_="listing-detail-sqft-val")
    print(foot.text.strip())

url = 'https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html'
get_data(url)

What I want to have is something like the above but instead get_data() will take in address, state, and zipcode. My apologies if this is not a suitable question for this site.

So you are looking for a way to generate the `gid` for a given address? — smac89, Feb 26 '19 at 22:55
I don't know how, but that site has a form and you can find out what the url of the post request is for the form, then use that url to get the data you need. — smac89, Feb 26 '19 at 23:18
When I look at the requests sent by the form, I see a url to an API and it looks like this when I typed only `Laguna Beach` into the location box: `https://www.remax.com/api/listings/?location=Laguna%20Beach,%20CA&Count=25&pagenumber=1&pageCount=10&tab=map&sh=true&maplistings=1&maplistcards=5&sv=true&sortorder=newest&view=forsale&&_=1551223014830`. Maybe you can use that — smac89, Feb 26 '19 at 23:21
This is the location of the form: `https://www.remax.com/realestatehomesforsale/ca-sitemap.html` — smac89, Feb 26 '19 at 23:22
@smac89 Could you provide an answer I am pretty new to web scrapping so I am bit lost on what you mean — Wolfy, Feb 26 '19 at 23:23
What I mean is that it is not possible to obtain that gid by yourself because it seems to be auto generated for each listing. So what you need to do is to find an api which you can manipulate in some way to get the actual listing you are afer. I showed you an example API used by one of the forms on that page, so you can explore that further. This an example of the data I was able pull from one of the apis: https://codebeautify.org/jsonviewer/cb864e37 and this is the curl command that produced that: https://pastebin.com/MbqqEn9x — smac89, Feb 27 '19 at 00:13
Giving an actual solution to this question will take some digging and I don't have the time now. Open up your chrome debugger and start examining the API calls made on the pages until you find the right api that does this. You could also just take the curl request I posted and chop it into manageable parts so that you can make multiple requests with it. HTH. That is all I can do for now. Unfortunately one of the hazards of web-scraping is that you kinda have to dig deep to find the right path to a solution. The API is one possible path, but the HTML might have another story to tell — smac89, Feb 27 '19 at 00:15
How are you getting the list of properties that you want to get? i.e. how would you do this task manually? — Martin Evans, Feb 27 '19 at 08:20
@MartinEvans I would put in the address, city, state, zipcode into remax.com and look for lotsize. — Wolfy, Feb 27 '19 at 16:52
What is the URL of that form? Could you give an example for me to try? [this one](https://www.remax.com/realestatehomesforsale/ca-sitemap.html) doesn't have those fields. — Martin Evans, Feb 27 '19 at 19:05
@MartinEvans The url here https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html — Wolfy, Feb 27 '19 at 19:23
@MartinEvans Yes correct, thats where the data I need exists — Wolfy, Feb 27 '19 at 19:26
I realise that, but the solution to your problem is usually to study the form that took you to that page rather than attempting to construct a URL to directly access it. That form page would probably give you the gid that you need. Namely the solution might need to be to search for the property. — Martin Evans, Feb 27 '19 at 19:30
@MartinEvans The way I got to that page I started at remax.com then clicked on home estimates and copy and pasted in the property address — Wolfy, Feb 27 '19 at 19:33
@MartinEvans My apologies if that is not helpful I find this particular site difficult to scrape data from. — Wolfy, Feb 27 '19 at 20:33
The site returns all properties matching a given search for a given map area. This return holds the URL (with gid) for all properties matching that given area. The difficulty would then be choosing a lat/long for a rectangle (nw & se corners) for which you want the list of properties returned. — Martin Evans, Feb 27 '19 at 21:17
@MartinEvans I see, would having latitude and longitude data help find the gid number? — Wolfy, Feb 27 '19 at 21:19
If you have an exact address, you would need to determine it's lat/long, then make a rectangle containing it, submit that request and you will get JSON back containing the URL to use for the property. — Martin Evans, Feb 27 '19 at 21:20
@MartinEvans I can get the latititude and longtitude of from a mapquest web scraper I made. Should I make a new question to get the url for each property address? — Wolfy, Feb 27 '19 at 21:21
I'll add an answer explaining it. You could then start a separate question if needed - I'll leave the lat/long to you. — Martin Evans, Feb 27 '19 at 21:23

Martin Evans · Accepted Answer · 2019-02-28T09:23:33.467

The site has a JSON API that lets you get all of the details of properties in a given rectangle. The rectangle is given by latitude and longitude coordinates for the NW and SE corners. The following request shows a possible search:

import requests

params = {
    "nwlat" : 41.841966864112,          # Calculate from address
    "nwlong" : -74.08774571289064,      # Calculate from address
    "selat" : 41.64189784194883,        # Calculate from address
    "selong" : -73.61430363525392,      # Calculate from address
    "Count" : 100,
    "pagenumber" : 1,
    "SiteID" : "68000000",
    "pageCount" : "10",
    "tab" : "map",
    "sh" : "true",
    "forcelatlong" : "true",
    "maplistings" : "1",
    "maplistcards" : "0",
    "sv" : "true",
    "sortorder" : "newest",
    "view" : "forsale",
}

req_properties = requests.get("https://www.remax.com/api/listings", params=params)
matching_properties_json = req_properties.json()

for p in matching_properties_json[0]:
    print(f"{p['Address']:<40}  {p.get('BedRooms', 0)} beds | {int(p.get('BathRooms',0))} baths | {p['SqFt']} sqft")

This results in 100 responses (obviously a tighter rectangle would then reduce the results). For example:

3 Pond Ridge Road                         2 beds | 3.0 baths | 2532 sqft
84 Hudson Avenue                          3 beds | 1.0 baths | 1824 sqft
116 HUDSON POINTE DR                      2 beds | 3.0 baths | 2455 sqft
6 Falcon Drive                            4 beds | 3.0 baths | 1993 sqft
53 MAPLE                                  5 beds | 2.0 baths | 3511 sqft
4 WOODLAND CIR                            3 beds | 2.0 baths | 1859 sqft
.
.
.
95 S HAMILTON ST                          3 beds | 1.0 baths | 2576 sqft
40 S Manheim Boulevard                    2 beds | 2.0 baths | 1470 sqft

Given you have an address, you would then need to calculate the latitude and longitude for that address. Then create a small rectangle around it for the NW and SE corners. Then build a URL with those numbers. You will then get a list of all properties (hopefully 1) for the area.

To make a search square, you could use something like:

lat = 41.841966864112
long = -74.08774571289064
square_size = 0.001

params = {
    "nwlat" : lat + square_size,
    "nwlong" : long - square_size,
    "selat" : lat - square_size,
    "selong" : long + square_size,
    "Count" : 100,
    "pagenumber" : 1,
    "SiteID" : "68000000",
    "pageCount" : "10",
    "tab" : "map",
    "sh" : "true",
    "forcelatlong" : "true",
    "maplistings" : "1",
    "maplistcards" : "0",
    "sv" : "true",
    "sortorder" : "newest",
    "view" : "forsale",
}

square_size would need to be adjusted depending on how accurate your address is.

Okay, so if I have columns for the properties such as Address, Latitude, and longitude. How do I get the results you got? — Wolfy, Feb 27 '19 at 21:47
I tried your approach but I came across some errors, could you see this https://stackoverflow.com/questions/55342568/web-scraping-from-remax-com — Wolfy, Mar 25 '19 at 19:57
You can use your browser's network tools to monitor the traffic — Martin Evans, Mar 25 '19 at 21:59
I am unaware of how that works, have you been able to look at my post I attached above? — Wolfy, Mar 25 '19 at 22:02
Not had a chance to look tonight. On Firefox you can use Ctrl+Shift+E — Martin Evans, Mar 25 '19 at 22:18

Web scraping remax.com for python

1 Answers1

Linked