Scrapy not generating links properly

Question

My CrawlSpider:

class FabulousFoxSpider(CrawlSpider):
    """docstring for EventsSpider"""
    name="fabulousfox"
    allowed_domains=["fabulousfox.com"]
    start_urls=["http://www.fabulousfox.com"]
    rules = (
        Rule(SgmlLinkExtractor(
            allow=(
                '/shows_page_(single|multi).aspx\?usID=(\d)*'
                ),
            unique=True),
            'parse_fabulousfox',
            ),
        )

But when I do scrapy crawl fabulousfox -o data.json -t json

i get the output as:

...................
......................
2014-03-01 13:11:56+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-03-01 13:11:56+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-03-01 13:11:57+0530 [fabulousfox] DEBUG: Crawled (200) <GET http://www.fabulousfox.com> (referer: None)
2014-03-01 13:11:57+0530 [fabulousfox] DEBUG: Crawled (403) <GET http://www.fabulousfox.com/../shows_page_multi.aspx?usID=365> (referer: http://www.fabulousfox.com)
2014-03-01 13:11:58+0530 [fabulousfox] DEBUG: Crawled (403) <GET http://www.fabulousfox.com/../shows_page_single.aspx?usID=389> (referer: http://www.fabulousfox.com)
2014-03-01 13:11:58+0530 [fabulousfox] DEBUG: Crawled (403) <GET http://www.fabulousfox.com/../shows_page_multi.aspx?usID=388> (referer: http://www.fabulousfox.com)
2014-03-01 13:11:58+0530 [fabulousfox] DEBUG: Crawled (403) <GET http://www.fabulousfox.com/../shows_page_single.aspx?usID=394> (referer: http://www.fabulousfox.com)
2014-03-01 13:11:58+0530 [fabulousfox] DEBUG: Crawled (403) <GET http://www.fabulousfox.com/../shows_page_multi.aspx?usID=358> (referer: http://www.fabulousfox.com)
2014-03-01 13:11:58+0530 [fabulousfox] INFO: Closing spider (finished)
2014-03-01 13:11:58+0530 [fabulousfox] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 1660,
     'downloader/request_count': 6,
     'downloader/request_method_count/GET': 6,
     'downloader/response_bytes': 12840,
     'downloader/response_count': 6,
     'downloader/response_status_count/200': 1,
     'downloader/response_status_count/403': 5,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 3, 1, 7, 41, 58, 218296),
     'log_count/DEBUG': 8,
     'log_count/INFO': 7,
     'memdebug/gc_garbage_count': 0,
     'memdebug/live_refs/FabulousFoxSpider': 1,
     'memusage/max': 33275904,
     'memusage/startup': 33275904,
     'request_depth_max': 1,
     'response_received_count': 6,
     'scheduler/dequeued': 6,
     'scheduler/dequeued/memory': 6,
     'scheduler/enqueued': 6,
     'scheduler/enqueued/memory': 6,
     'start_time': datetime.datetime(2014, 3, 1, 7, 41, 56, 360266)}
2014-03-01 13:11:58+0530 [fabulousfox] INFO: Spider closed (finished)

Why the url's generated contain ...
http://www.fabulousfox.com/../shows_page_multi.aspx?usID=365

Also it's not generating all the url's. What's wrong in here?

I'm experiencing the same issues with the newer version of `Scrapy`. — , Mar 01 '14 at 07:48
possible duplicate of [Python Scrapy: Convert relative paths to absolute paths](http://stackoverflow.com/questions/6499603/python-scrapy-convert-relative-paths-to-absolute-paths) — Has QUIT--Anony-Mousse, Mar 01 '14 at 09:38

paul trmbrth · Accepted Answer · 2014-03-01T15:09:08.960

Inspecting the page HTML source code for http://www.fabulousfox.com you notice table rows like that:

<tr>
    <td width="7">
        <img src="images/home_shows_frame_left.jpg" width="7" height="128" />
    </td>
    <td width="155" height="128" align="center" valign="middle">
        <a id="Box4" href="../shows_page_single.aspx?usID=394"><img id="Image4" src="../images/ShowLogos/394.jpg" alt="Rickey Smiley's" style="border-width:0px;" /></a>
    </td>
    <td width="7" align="right">
        <img src="images/home_shows_frame_right.jpg" width="7" height="128" />
    </td>
</tr>

Although a browser will understand these links and lead you to http://www.fabulousfox.com/shows_page_single.aspx?usID=394, Scrapy's SgmlLinkExtractor will use urlparse.urljoin() internally:

>>> import urlparse
>>> urlparse.urljoin('http://www.fabulousfox.com/', '../shows_page_single.aspx?usID=394')
'http://www.fabulousfox.com/../shows_page_single.aspx?usID=394'
>>>

You could help the link extractor by providing a process_value callable,

SgmlLinkExtractor(process_value=lambda u: u.replace('../', '/'))

but it will probably not do what you want in all cases

but from where are the leading dots coming? why are they there? — mrudult, Mar 01 '14 at 15:11
I got my mistake. Didn't look at the url's carefully. thanks anyways. — mrudult, Mar 01 '14 at 15:13

score 0 · Answer 2 · answered Mar 01 '14 at 09:40

0

You don't handlre relative links correctly.

Use urlparse.urljoin to construct valid links.

answered Mar 01 '14 at 09:40

Has QUIT--Anony-Mousse

76,138
12
138
194

I didn't get why I need to do that? I have written a crawler which does the same thing above but in older version of scrapy. there weren't any issues there. – mrudult Mar 01 '14 at 12:40
I don't want to convert links from relative to absolute or vice-versa. Just that, not a single url is getting crawled, not even the starting one. – mrudult Mar 01 '14 at 12:47
The starting one is crawled (code 200 = success), but all others are incorrect and thus yield an error code. When you access an incorrect URL, you usually get an error. – Has QUIT--Anony-Mousse Mar 01 '14 at 13:51
yeah I get that. but why are the url's not properly generated? I have written few crawlers like this only. there they work all fine. the url's are properly created. just not in this case. why I need to explicitly generate proper url's as you say in your answer? – mrudult Mar 01 '14 at 14:09
Probably because these sites just didn't use `../something` relative URLs incorrectly in their web site. – Has QUIT--Anony-Mousse Mar 01 '14 at 15:25

Scrapy not generating links properly

2 Answers2