1

I need to iterate a form, filling out it with different options. I already can crawl/scrape data using Scrapy and Python for one set of variables, but I need to iterate through a list of them.

Currently, my spider can log in, fills the form and scrapes the data.

To log in and complete the form I use:

class FormSpider(CrawlSpider):
    name= 'formSpider'
    allow_domain = ['example.org']
    start_urls = ['https://www.example.org/en-en/']

    age = '35'
    days = '21'
    S1 = 'abc'
    S2 = 'cde'
    S3 = 'efg'
    S4 = 'hij'
 
    def parse(self, response):
        token = response.xpath('//*[@name="__VIEWSTATE"]/@value').extract_first()
        return FormRequest.from_response(response,
                                         formdata={'__VIEWSTATE': token,
                                                   'Password': 'XXXXX',
                                                   'UserName': 'XXXXX'},
                                         callback=self.scrape_main)

And I use this code to complete the Form:

    def parse_transfer(self, response):
            return FormRequest.from_response(response,
                                           formdata={"Age" : self.age,
                                                     "Days" : self.days,
                                                     "Skill_1" : self.S1,
                                                     "Skill_2" : self.S2,
                                                     "Skill_3" : self.S2,
                                                     "Skill4" : self.S3                                                     
                                                     "butSearch" : "Search"},
                                           callback=self.parse_item)

Then, I scrape the data and export it as CSV.

What I need now is to iterate the inputs from the form. I was thinking of using a list for each variable to change the form each time (I only need a certain number of combinations).

    age = ['35','36','37','38']
    days = ['10','20','30','40']
    S1 = ['abc','def','ghi','jkl']
    S2 = ['cde','qwe','rty','yui'] 
    S3 = ['efg','asd','dfg','ghj']
    S4 = ['hij','bgt','nhy','mju']

So I can iterate the form in a way like:

age[0],days[0],S1[0],S2[0],S3[0],S4[0]... age[1],days[1]... and so on 

Any recommendation? I am open to different options (not only lists) to avoid creating multiple spiders.

UPDATE

This is the final code:

    def parse_transfer(self, response):
            return FormRequest.from_response(response,
                                           formdata={"Age" : self.age,
                                                     "Days" : self.days,
                                                     "Skill_1" : self.S1,
                                                     "Skill_2" : self.S2,
                                                     "Skill_3" : self.S2,
                                                     "Skill4" : self.S3                                                     
                                                     "butSearch" : "Search"},
                                           dont_filter=True,
                                           callback=self.parse_item)
    def parse_item(self, response):
        open_in_browser(response)
        # it opens all the websites after submitting the form :)
crianopa
  • 67
  • 1
  • 7

1 Answers1

1

It's hard to understand what your current parse_transfer() is meant to be doing because your FormSpider doesn't have a self.skill_1 that we can see. Also you may not need to inherit from CrawlSpider here. And change the returns to yields.

To iterate on the form, I recommend replacing the spider attributes you currently have with the lists you will use for iteration.

Then loop in parse_transfer()

def parse_transfer(self, response):
    for i in range(len(age)):
        yield FormRequest.from_response(response,
                                       formdata={"Age" : self.age[i],
                                                 "Days" : self.days[i],
                                                 "Skill_1" : self.S1[i],
                                                 "Skill_2" : self.S2[i],
                                                 "Skill_3" : self.S3[i],
                                                 "Skill_4" : self.S4[i]
                                                 "butSearch" : "Search"},
                                       callback=self.parse_item)

This may not be a viable solution based on the way the website accepts requests, though.

pwinz
  • 303
  • 2
  • 14
  • You are right. The skill1 shouldn't be there (edited). I will try to your option. Thanks. – crianopa Jul 02 '19 at 23:37
  • Super... The suggestions work perfectly. Since I am starting with Scrapy, can you explain to me how **return** and **yield** work in this case? – crianopa Jul 03 '19 at 10:55
  • Ok, reviewing the results... The iteration works. But only the last iteration pass the request to Def parse_item(self, response). Any suggestion? – crianopa Jul 04 '19 at 03:51
  • All right! I have to add dont_filter = True, at the end of the form request(). – crianopa Jul 04 '19 at 05:11
  • Yes, dont_filter is necessary because the BaseDupeFilter is seeing the request fingerprint as identical. And yield is not unique to Scrapy, it is a python keyword that is important to know. https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do – pwinz Jul 04 '19 at 13:07