-2

I recently finished a WebScrapping/Automation Zillow program for my boot camp. Instructor encouraged google as I was having issues with only being able to get the first couple of listing.

I stumbled upon this answer: Zillow web scraping using Selenium & BeautifulSoup

This worked well since instead of using bs4's find all method, I was able to get all of my listing neatly placed in a JSON file which was much easier to go through and complete the project. I only recently learned about regex and the re module on python and I was wondering if someone can explain how this code worked to help me retrieve the the nicely listed JSON from the get response and if this would work for other websites?

Code was:

self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))
  1. What arguments was taken account for on the json.loads?
  2. How did the oddly written !--({"queryState".*?)--> work?
  3. What is the purpose of the .group(1)?

I hate just copy and pasting but somehow this worked like magic and Id like to know how to replicate this for future projects. Sorry if this is loaded but the re.search documentation wasn't as helpful as I thought.

Grismar
  • 27,561
  • 4
  • 31
  • 54
KevCo
  • 11
  • 1
  • 1
    https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean – Nick Sep 12 '22 at 01:19
  • 1
    https://docs.python.org/3/library/re.html#re.search – Nick Sep 12 '22 at 01:20
  • Thank you Nick for the search documentation, it explains that the first argument passed was the pattern and we used the get request for the string. – KevCo Sep 12 '22 at 01:26
  • 1
    Click the link on `match object` and it will also explain what `group(1)` is about... – Nick Sep 12 '22 at 01:27

1 Answers1

2
  1. json.loads() can work with a single argument, a string that will be parsed as JSON and the return value is typically a dictionary or list (depending on the JSON). Here, that single string is the return value of the call to .group(1)
  2. How is r'!--(\{"queryState".*?)-->' oddly written? It is a regular expression that is being applied to self.response.text using re.search(). It looks for the literal !-- and --> followed by something starting with {"queryState". The \ is there to indicated that the { is to be matched literally as well. The .*? indicates "any character zero or more times, not greedily" (to avoid matching --> as part of it).
  3. .group(1) returns the first matched group in the regex, which is the first part in parentheses. In this case, anything in between !-- and -->, if it starts with {"queryState"

So, if self.response.text would be this:

something
!--{"not queryState": 123}-->
something else
!--{"queryState": 123}-->
something else

Then running this:

self.data = json.loads(re.search(r'!--(\{"queryState".*?)-->', self.response.text).group(1))

Would set self.data to {'queryState': 123} (as json.loads() takes the string "{'queryState': 123}" and parses it into a dictionary, as user @Nick correctly pointed out)

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Grismar, this was fantastic! Sorry the oddly written was simply cause I didn't understand what was going on. The dictionary was nested between a "!--" and "-->" string with the get request. The dictionary began with: { "queryState": and I'm assuming the .* told it to grab everything in between. I apologize for my wording, I'm bad at figuring things out on my own but when given direction it begins to click. Thanks again! – KevCo Sep 12 '22 at 01:36