1

I am trying to scrape https://a836-propertyportal.nyc.gov/Default.aspx with Scrapy. I am having difficulty using the FormRequest--specifically, I do not know how to tell Scrapy how to fill the block and lot forms out, and then subsequently get the response of the page. I tried following the FormRequest example on the Scrapy website found here (http://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login), but continued to have difficulty with properly clicking on the "Search" button.

I would really appreciate it if you could offer any suggestions so that I can extract data from the submitted page. Some poster on SO suggested that Scrapy cannot handle JS events well, and to use another library like CasperJS instead.

Update: I would very much appreciate it if someone could please point me to a Java/Python/JS library that allows me to submit a form, and retrieve the subsequent information

Updated Code (following Pawel's comment): My code can be found here:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import FormRequest, Request

class MonshtarSpider(Spider):
name = "monshtar"
allowed_domains = ["https://a836-propertyportal.nyc.gov/Default.aspx"]
start_urls = (
    'https://a836-propertyportal.nyc.gov/Default.aspx/',
    )

def parse(self, response):
    print "entered the parsing section!!"

    yield Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", 
        cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"}, callback = self.aftersubmit)


def aftersubmit(self, response):
    #get the data....
    print "SUCCESS!!\n\n\n"
NumenorForLife
  • 1,736
  • 8
  • 27
  • 55

2 Answers2

3

Your page is somewhat bizzare and difficult to parse, after submitting valid POST request page responds with 302 http status and a bunch of cookies (your formdata is invalid by the way, you need to replace underscores with dollars in your parameters).

Content can be viewed after sending GET to https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx

Most surprising thing is that you can crawl this site using only cookies, without POST request. POST is there only to give you cookies, it does not redirect to or respond with html response. You can manipulate those cookies from your spider. You only need to make first GET to get session cookie, and then successive GETS with borough, block etc.

Try this in scrapy shell:

pawel@stackoverflow:~/stack/scrapy$ scrapy shell "https://a836-propertyportal.nyc.gov/Default.aspx"

In [1]: from scrapy.http import Request

In [2]: req = Request("https://a836-propertyportal.nyc.gov/ExemptionDetails.aspx", cookies = {"borough":"1", "block":"01000", "style":"default", "lot":"0011"})

In [3]: fetch(req)

In [4]: view(response)

Out[5]: True # opening browser window

Response at this point will contain data for property with given block, borough and lot. Now you only need to use this knowledge in your spider. Just replace your POST with GET with cookies, add callback to what you have in shell and it should work fine.

If this still does not work or is somehow unsuited to your purposes try extracting hidden ajax parameter (the value of nullctl00_ScriptManager1_HiddenField), add this to formdata (and of course correct your formdata so that it is identical to what browser sends).

Pawel Miech
  • 7,742
  • 4
  • 36
  • 57
  • Hi @Pawel: I am now able to print out the response, thanks so much! However, not all the values of interest appear. For example, the Tax Rate does appear when I print out the body of the response, but the Owner Name does not. How would you recommend I remedy this? – NumenorForLife May 25 '14 at 23:20
0

You don't click the search button but you make a POST request to a page with all the data. But checking the code, it's send a lot of data. Below I posted my requests...

ctl00_ScriptManager1_HiddenField:;;AjaxControlToolkit, Version=3.0.11119.25904, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:f48478dd-9360-4d50-94c1-5c5fa55bd379:865923e8:411fea1c:e7c87f07:91bd373d:1d58b08c:8e72a662:acd642d2:596d588c:77c58d20:14b56adc:269a19ae:bbfda34c:30a78ec5:5430d994
__EVENTTARGET:
__EVENTARGUMENT:
__VIEWSTATE:/wEPDwULLTEwMDA4NDY4ODAPZBYCZg9kFgICBQ9kFgQCAg9kFgQCAQ8WAh4HVmlzaWJsZWhkAgcPFgIfAGgWAgIBDxYCHglpbm5lcmh0bWwFGEFsZXJ0IGZvcjxiciAvPiBCQkwgOiAtLWQCBA9kFgQCAg9kFgQCAQ9kFgRmDw8WBB4IQ3NzQ2xhc3MFF2FjY29yZGlvbkhlYWRlclNlbGVjdGVkHgRfIVNCAgJkZAIBDw8WBB8CBRBhY2NvcmRpb25Db250ZW50HwMCAhYCHgVzdHlsZQUOZGlzcGxheTpibG9jaztkAgIPZBYEZg8PFgQfAgUPYWNjb3JkaW9uSGVhZGVyHwMCAmRkAgEPDxYEHwIFEGFjY29yZGlvbkNvbnRlbnQfAwICFgIfBAUNZGlzcGxheTpub25lOxYCAgEPZBYCZg9kFgZmDw9kFgIfBAUNZGlzcGxheTpub25lO2QCDA8PFgIfAGhkZAINDw8WAh8AaGRkAgMPD2QWBh4FU3R5bGUFN3dpZHRoOjM1MHB4O2JhY2tncm91bmQ6d2hpdGU7ZGlzcGxheTpub25lO29wYWNpdHk6MC45MjseC29ubW91c2VvdmVyBQ93d2hIZWxwLnNob3coKTseCm9ubW91c2VvdXQFD3d3aEhlbHAuaGlkZSgpO2Rky2sFuMlw1iy/E0GN9cB65RXg7Aw=
__EVENTVALIDATION:/wEWGgKWm9a2BgL687aTAwLmha0BAujn2IECAo3DtaEJAtLdz/kGAr3g5K4DAu78ttcEAvOB3+MGAvKB3+MGAvGB3+MGAvCB3+MGAveB3+MGAoHAg44PArT/mOoPAqrvlMAJAtzQstcEAoDswboFAoHswboFAoLswboFAoPswboFAoTswboFAtjqpO8KAujQ7b0GAqvgnb0NAsPa/KsBQz19YIqBRvCWvZh8bk6XKxp+wQo=
grpStyle:blue
ctl00$SampleContent$MyAccordion_AccordionExtender_ClientState:0
ctl00$SampleContent$ctl01$TextBox1:(unable to decode value)
ctl00$SampleContent$ctl01$ddlParclBorough:1
ctl00$SampleContent$ctl01$txtBlock:100
ctl00$SampleContent$ctl01$txtLot:200
ctl00$SampleContent$ctl01$btnSearchBBL:Please Wait...
ctl00$SampleContent$ctl03$TextBox2:(unable to decode value)
ctl00$SampleContent$ctl03$ddlParclBoroughPropAddr:1
ctl00$SampleContent$ctl03$txtHouseNbr:
ctl00$SampleContent$ctl03$txtStreetNm:
ctl00$SampleContent$ctl03$txtAptNbr:

My suggestion is to use a scrap lib which supports executing JS. Or use something else. I had many success using Selenium and WebDriver to execute code in browser, which supports JS.

Update:

You have an example How to submit a form using PhantomJS.

Community
  • 1
  • 1
FrEaKmAn
  • 1,785
  • 1
  • 21
  • 47
  • What scraping libraries would you recommend I use then? Do you use Scrapy + Selenium/WebDriver? – NumenorForLife May 25 '14 at 16:42
  • I'm using scrapy for non javascript websites and selenium when javascript is involved. afaik selenium was not built for scraping, but it works. Make sure you check around, I'm pretty sure there are libraries/projects for scraping which support JS. – FrEaKmAn May 25 '14 at 16:49
  • @eddwinpaz: BeautifulSoup won't be of any use. It just parses HTML. – Blender May 25 '14 at 21:29
  • Yes, you cannot do that with BeautifulSoup. @jsc123 check my update and link to PhantomJS. – FrEaKmAn May 25 '14 at 23:31