Following hyperlink and "Filtered offsite request"

Question

I know that there are several related threads out there, and they have helped me a lot, but I still can't get all the way. I am at the point where running the code doesn't result in errors, but I get nothing in my csv file. I have the following Scrapy spider that starts on one webpage, then follows a hyperlink, and scrapes the linked page:

from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class bbrItem(Item):
    Year = Field()
    AppraisalDate = Field()
    PropertyValue = Field()
    LandValue = Field()
    Usage = Field()
    LandSize = Field()
    Address = Field()    

class spiderBBRTest(BaseSpider):
    name = 'spiderBBRTest'
    allowed_domains = ["http://boliga.dk"]
    start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=Septembervej&hus_nr=29&ipostnr=2730']

    def parse2(self, response):        
        hxs = HtmlXPathSelector(response)
        bbrs2 = hxs.select("id('evaluationControl')/div[2]/div")
        bbrs = iter(bbrs2)
        next(bbrs)
        for bbr in bbrs:
            item = bbrItem()
            item['Year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract()
            item['AppraisalDate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract()
            item['PropertyValue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract()
            item['LandValue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract()
            item['Usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract()
            item['LandSize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract()
            item['Address']  = response.meta['address']
            yield item

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        PartUrl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract())
        url2 = ''.join(["http://www.boliga.dk", PartUrl])
        yield Request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)

I am trying to export the results to a csv file, but I get nothing the file. Running the code, however, doesn't result in any errors. I know it's a simplyfied example with only one URL, but it illustrates my problem.

I think my problem could be that I am not telling Scrapy that I want to save the data in the Parse2 method.

BTW, I run the spider as scrapy crawl spiderBBR -o scraped_data.csv -t csv

Does `parse2` get called - I can't see where from if it does. There doesn't seem to be anything that tries to write out to a csv file either. — Steve Allison, Jul 25 '13 at 16:02
@SteveAllison: Ups, that's a typo. I have changed it so that I call `parse2` as callback in the request in `parse`, but it still doesn't work. — Mace, Jul 25 '13 at 19:28

Talvalin · Accepted Answer · 2013-07-26T08:30:55.803

46

You need to modify your yielded Request in parse to use parse2 as its callback.

EDIT: allowed_domains shouldn't include the http prefix eg:

allowed_domains = ["boliga.dk"]

Try that and see if your spider still runs correctly instead of leaving allowed_domains blank

edited Jul 26 '13 at 08:30

answered Jul 25 '13 at 18:04

Talvalin

7,789
2
30
40

I corrected the example (this was just a typo - I had it in my actual spider), and it still gives me an empty csv file. – Mace Jul 25 '13 at 19:31
So I'm testing your spider and it won't start due to a global `sites` variable not being declared. – Talvalin Jul 26 '13 at 08:28
Sorry there were a few typos. Now it should work - I have tried it - even with `allowed_domains = ["boliga.dk"]` and I still don't get any data in the `csv` file (nor any errors). I believe my xpaths are correct, since I have checked them in xPath checker. – Mace Jul 31 '13 at 08:12
I think my problem is that the `for` loop in `parse2` doesn't start. I thought `hxs2.select("id('evaluationControl')/div[2]/div") ` would return an `iterable`, because there are 4 matching nodes, but I don't know how to check the type. – Mace Jul 31 '13 at 08:22
7

Either leaving the `allowed_domains`blank or removing the `http` prefix solves the problem. The other problems were only typos and unrelated to the toic of the question. Thanks for the answer! – Mace Jul 31 '13 at 09:36
Thanks @Mace - the suggestion of leaving the allowed domains blanked helped me! – Maverick Oct 17 '17 at 09:34

score 11 · Answer 2 · edited Jan 14 '16 at 11:28

11

try make this dont_filter=true

yield Request(url=url2, meta{'address':hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2,dont_filter=True)

edited Jan 14 '16 at 11:28

Andrew

1,474
4
19
27

answered Jan 14 '16 at 10:34

Balaji D

111
1
6

If someone's wondering why: https://stackoverflow.com/questions/38951878/how-does-adding-dont-filter-true-argument-in-scrapy-request-make-my-parsing-meth – Mohammad Tbeishat Jul 30 '21 at 19:05

Following hyperlink and "Filtered offsite request"

2 Answers2

Linked