1

I have the following spider that's pretty much just supposed to Post to a form. I can't seem to get it to work though. The response never shows when i do it through Scrapy. Could some one tell me where i'm going wrong with this?

Here's my spider code:

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response


class RajasthanSpider(scrapy.Spider):
    name = "rajasthan"
    allowed_domains = ["rajtax.gov.in"]
    start_urls = (
        'http://www.rajtax.gov.in/',
    )

    def parse(self, response):
        return FormRequest.from_response(
            response,
            formname='rightMenuForm',
            formdata={'dispatch': 'dealerSearch'},
            callback=self.dealer_search_page)

    def dealer_search_page(self, response):

        yield FormRequest.from_response(
            response,
            formname='dealerSearchForm',
            formdata={
                "zone": "select",
                "dealertype": "VAT",
                "dealerSearchBy": "dealername",
                "name": "ana"
            }, callback=self.process)

    def process(self, response):
        inspect_response(response, self)

What i get is a response as such: No result Found

What I should be getting is a result like this: Results Found

When i replace my dealer_search_page() with Splash as such:

def dealer_search_page(self, response):

    yield FormRequest.from_response(
        response,
        formname='dealerSearchForm',
        formdata={
            "zone": "select",
            "dealertype": "VAT",
            "dealerSearchBy": "dealername",
            "name": "ana"
        },
        callback=self.process,
        meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
            }
        })

i get the following warning:

2016-03-14 15:01:29 [scrapy] WARNING: Currently only GET requests are supported by SplashMiddleware; <POST http://rajtax.gov.in:80/vatweb/dealerSearch.do> will be handled without Splash

And the program exits before it reaches my inspect_response() in my process() function.

The error says that Splash doesn't support POST yet. Will Splash work for this use case or should i be using Selenium?

BoreBoar
  • 2,619
  • 4
  • 24
  • 39
  • I do see the `inspect_response` work and the shell opened. What is happening on your end? – alecxe Mar 13 '16 at 11:20
  • I added a screen shot from my `view(response)`, and then the screen shot of what actually should be happening once the search button is clicked. Do you see the response like the second screen shot at your end? All i see is the first. – BoreBoar Mar 13 '16 at 11:41

2 Answers2

3

Now Splash supports POST request. Try SplashFormRequest or {'splash':{'http_method':'POST'}}

Based on https://github.com/scrapy-plugins/scrapy-splash

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
2

You can approach it with selenium. Here is an complete working example where we submit the form with the same search parameters as in your Scrapy code and print the results on the console:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get("http://www.rajtax.gov.in/")

# accept the alert
driver.switch_to.alert.accept()

# open "Search for Dealers"
wait = WebDriverWait(driver, 10)
search_for_dealers = wait.until(EC.visibility_of_element_located((By.PARTIAL_LINK_TEXT, "Search for Dealers")))
search_for_dealers.click()

# set search parameters
dealer_type = Select(driver.find_element_by_name("dealertype"))
dealer_type.select_by_visible_text("VAT")

search_by = Select(driver.find_element_by_name("dealerSearchBy"))
search_by.select_by_visible_text("Dealer Name")

search_criteria = driver.find_element_by_name("name")
search_criteria.send_keys("ana")

# search
driver.find_element_by_css_selector("table.vattabl input.submit").click()

# wait for and print results
table = wait.until(EC.visibility_of_element_located((By.XPATH, "//table[@class='pagebody']/following-sibling::table")))

for row in table.find_elements_by_css_selector("tr")[1:]:  # skipping header row
    print(row.find_elements_by_tag_name("td")[1].text)

Prints the TIN numbers from the search results table:

08502557052
08451314461
...
08734200736

Note that the browser you automate with selenium can be headless - PhantomJS or regular browsers on a virtual display.


Answering the initial question (before the edit):

What I see on the Dealer Search page - the form and its fields are constructed with a bunch of JavaScript scripts executed in the browser. Scrapy cannot execute JS, you need to help it with this part. I am pretty sure Scrapy+Splash would be enough in this case and you would not need to go into browser automation. Here is a working example of using Scrapy with Splash:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Okay, i'm kind of lost again. I've been using `FormRequest.from_response()` to get all the cookies and the session id's and the other form fields. How do i make a `Request` to my `process` function with all the form data parameters that i got through `FormRequest.from_response`? Since i can only use `meta` with `Requests` and not `FormRequests`? :s – BoreBoar Mar 13 '16 at 16:02
  • @MetalloyD `FormRequest` is a subclass of the regular `Request` - you should be able to use the same `meta` values as with a regular `Request`. At the moment, I'm having a hard time setting up the splash container locally and not able to reproduce the issue. What code do you currently have (post it in the question or into a gist)? Thanks – alecxe Mar 14 '16 at 03:00
  • Okay, I've changed the code to what i've tried. I've also changed the question to make it more specific about Splash. – BoreBoar Mar 14 '16 at 09:40
  • @MetalloyD okay, check out the updated answer. Hope that helps. – alecxe Mar 14 '16 at 14:25
  • Thanks a lot alecxe! :D So, if i was to understand though, this type of use case can't be used with Splash? Can Scrapy be used along with Selenium for these kind of scrapes? Cuz Scrapy's really fast! And Selenium is pretty slow in comparison. – BoreBoar Mar 14 '16 at 16:43
  • I also have this other question where my results don't come out fine. Could you tell me what i'm doing wrong here as well. Thanks a lot though! http://stackoverflow.com/questions/35976080/scrapy-inconsistent-output – BoreBoar Mar 14 '16 at 16:45