0

This is my first post so I apologize if it is a duplicate but I could not find an answer relevant to mine. If there is one please let me know and I will check it out.

I am attempting to scrape a website(below) to find Berkeley rent ceiling, the trouble I'm having is I cannot seem to figure out how to insert an address into the search box and scrape the info from the next page. In the past the URLs I've worked with change with search input, but not on this website. I thought my best bet would be using bs4 to scrape the info and request.session and requests.post to get to each subsequent address.

#Berkeley Rent Scrape
from bs4 import BeauitfulSoup
import sys
import requests
import openpyxl
import pprint
import csv

#wb = openpyxl.load_workbook('workbook.xlsx', data_only=True)
#sheet = wb.get_sheet_by_name('worksheet')


props_payload={'aspnetForm':'1150 Oxford St'}
URL = 'http://www.ci.berkeley.ca.us/RentBoardUnitSearch.aspx'

s = requests.session()
p = s.post(ULR, data = props_payload)
soup = BeauitfulSoup(p.text)
data = soup.find_all('td', class="gridItem")

UPDATE How do you get the info from the new webpage once the post has been sent? Or in other words, what is framework for using a request.post then a request.get or bs4 scrape when the URL does not change?

I was thinking it would look something like this, but I'm sure I need a GET request somewhere in there but don't understand how sessions work when the URL doesn't change.

I will be exporting the info to a csv file and to a excel sheet, but I can deal with that later. Just want to get the meat out of the way.

Thank you for any help!

  • 1
    i dont see any actual question here... whats wrong with the solution you posted? – Joran Beasley Jan 11 '17 at 21:46
  • Thanks, I got a little caught up in the explanation – S_Stand_ring Jan 11 '17 at 21:52
  • 1
    This question is too broad. Stack Overflow isn't a place where you can ask other people to tutorialize or write code for you, but instead a place where you can ask specific questions when you need help or guidance. In this case, you're effectively asking someone to tell you how to write this code for you. Based on your code sample, it appears `data` might contain what you need... does it not? Please be specific. – garrettmurray Jan 11 '17 at 22:03
  • My intention was not bring a broad question, nor ask someone to write it for me so I apologize @garrettmurray that I posted a "garbage" post, as I dislike those types of posts also. I suppose what I am looking for is: What is framework for using a `request.post` then a `request.get` or `bs4` scrape when the URL does not change. – S_Stand_ring Jan 11 '17 at 22:16

1 Answers1

0

As you can see in the link this search works not through the redirection, so you can't pass your query into the URL. I'm not sure how you can work directly with the ASP.NET backend (it might be tricky due to authentication/validation on the backend). I think some automation (test) tool can help you (e.g PhantomJS and/or CasperJS). It gives you control over the rendered web page and you can programmatically put query into the input and grab data after response

  • PhantomJS is a standalone application (headless browser) and CasperJS is just JS wrapper. You can use PhantomJS with Python. Example http://stackoverflow.com/questions/13287490/is-there-a-way-to-use-phantomjs-in-python – Сергей Жильцов Jan 11 '17 at 22:21