5

The website that I am crawling contains many players and when I click on any player, I can go the his page.

The website structure is like this:

<main page>
<link to player 1>
<link to player 2>
<link to player 3>
..
..
..
<link to payer n>
</main page>

And when I click on any link, I go to player's page which is like this:

<player name>
<player team>
<player age>
<player salary>
<player date>

I want to scrap all the players those age is between 20 and 25 years.

what I am doing

  1. scraping the main page using first spider.

  2. getting links using first spider.

  3. crawl each link using second spider.

  4. get the player informatoin using second spider.

  5. save this information in json file using pipeline.

my question

how can I return the date value from second spider to the first spider

what i have tried

I build my own middelware and i override the process_spider_output. it allows me to print the request but I don't know what else should I do in order to return that date value to my first spider

any help is appreciated

Edit

Here is some of the code:

def parse(self, response):
        sel = Selector(response)
        Container = sel.css('div[MyDiv]')
        for player in Container:
            extract LINK and TITLE
            yield Request(LINK, meta={'Title': Title}, callback = self.parsePlayer)

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    return player

I gave you the general code, not the very specific details in order to make it easy for you

Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
  • 1
    By *spider* you mean *callback*? Can you show a bit of you spider code? – paul trmbrth Feb 07 '14 at 14:58
  • @pault. ok I will show you. I will post the code but I won't be available until after 2 hours because the laptop battery will be empty after 20 minutes and I won't reach home until 2 hours so please forigve me if i am late – Marco Dinatsoli Feb 07 '14 at 15:12
  • @pault. I am back and I edited the question. – Marco Dinatsoli Feb 07 '14 at 17:12
  • I don't think you'll be able to pass content from the 2nd callback back the first callback. But you could pass data from the 1st callback to the 2nd callback as you do for the `Title` field. Why do you need to *feed data back to the 1st callback*? – paul trmbrth Feb 07 '14 at 17:20
  • It is indeed unclear why you need to pass the `date` from the second function to the first one (one spider, two functions). If you need to save the date through the pipeline, why don't you save it to the player item? – Robin Feb 07 '14 at 18:36
  • @pault. I need to return the date from the second call back to the first call back because i want to stop crawling when i found the first play that its day is not between 20 and 25 years. Also, i can't pass the date from the first call back to the second one because the first call back scrapy the page the doesn't have the date item. the second call back scrap the page that has the date item. – Marco Dinatsoli Feb 07 '14 at 20:37
  • @Robin please read the above comment – Marco Dinatsoli Feb 07 '14 at 20:38
  • @MarcoDinatsoli: read it. You should update your question with that comment to make it clearer for everyone landing on the page. – Robin Feb 07 '14 at 23:43
  • @MarcoDinatsoli, see [this](http://stackoverflow.com/questions/9334522/scrapy-follow-link-to-get-additional-item-data/9340447#9340447) answer – warvariuc Feb 08 '14 at 04:36
  • @warwaruk this is not my case at all. My case is the date in the details page, not on the master page. – Marco Dinatsoli Feb 08 '14 at 05:32
  • @pault. check my answer bellow please – Marco Dinatsoli Feb 08 '14 at 19:26

3 Answers3

4

You want to discard players outside a range of dates

All you need to do is check the date in parsePlayer, and return only the relevant.

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    if DATE == some_criteria:
        yield player

You want to scrap every link in order and stop when some date is reached

For example, if you have performance issues (you are scrapping way too much links and you don't need the ones after some limit).

Given that Scrapy work in asymmetric requests, there is no real good way to do that. The only way you have is trying to force linear behavior instead of default parallel requests.

Let me explain. When you have two callbacks like that, on default behavior scrapy will first parse the first page (main page) and put in its queue all requests for the player pages. Without waiting for that first page to finish being scrapped, it will start treating these requests for player pages (not necessarily in the order it found them).

Therefore, when you get the information that the player page p is out of date, it has already sent internal requests for p+1, p+2...p+m (m is basically a random number) AND has probably started treating some of these requests. Possibly even p+1 before p (no fixed order, remember).

So no way to stop exactly at the right page if you keep this pattern, and no way to interact with parse from parsePlayer.

What you can do is force it to follow the links in order, so that you have full control. The drawback is that it will take a big toll on performance: if scrapy follows each link one after the other, it means it can't treat them simultaneously as it usually does and it slows things down.

The code could be something like:

def parse(self, response):
    sel = Selector(response)
    self.container = sel.css('div[MyDiv]')
    return self.increment(0)

# Function that will yield the request for player n°index
def increment(index):
    player = self.container[index] # select current player
    extract LINK and TITLE
    yield Request(LINK, meta={'Title': Title, 'index': index}, callback=self.parsePlayer)

def parsePlayer(self,response):
    player = new PlayerItem();
    extract DATE
    yield player

    if DATE == some_criteria:
        index = response.meta['index'] + 1 
        self.increment(index)

That way scrapy will get the main page, then the first player, then the main page, then the second player, then the main, etc... until it finds a date that doesn't fit the criteria. Then there is no callback to the main function and the spider stops.

This gets a little more complex if you have to also increment the index of the main page (if there are n main pages for example), but the idea stays the same.

Robin
  • 9,415
  • 3
  • 34
  • 45
  • Thanks for your answer, I was thinking the exactly way to do that. I have a question about the performance, if I did that, will I lose performance? Also, I do have many main pages :). I will try to suupport them myself and If i couldn't i will tell you – Marco Dinatsoli Feb 08 '14 at 05:12
  • the `parsePlayer` should return the items to the pipeline and it is not possible to do two returns. so... ? – Marco Dinatsoli Feb 08 '14 at 05:23
  • 1
    @MarcoDinatsoli, I posted an answer, but then found that Robin actually answered it the same way. The only suggestion is in `parse` initially to yield not one link, but a chunk of links (e.g. 10). This way you would have 10 simultaneous requests to player pages at any time. – warvariuc Feb 08 '14 at 06:15
  • @warwaruk in Robin's answer, I can't pass the player information to the pipeline because in the parsePlayer callback he returns increnment not the player information. right? – Marco Dinatsoli Feb 08 '14 at 06:18
  • In scrapy a callback can yield either an item or a request: http://doc.scrapy.org/en/latest/topics/spiders.html#spiders – warvariuc Feb 08 '14 at 06:23
  • @warwaruk he is returning `increnment(index)` which is a function. In other works, neight a request nor an item. right? – Marco Dinatsoli Feb 08 '14 at 06:28
  • @warwaruk I will be back in 20 minutes – Marco Dinatsoli Feb 08 '14 at 06:32
  • @warwaruk: Indeed, yielding `n` links each time is a pretty good way to restore some sort of performance! @MarcoDinatsoli: `return`on the `self.increment` function actually probably isn't necessary, I can't test on my current setup. And yes, as explained, it lowers performance because you treat one link at a time. Warwaruk's tweak allows to treat multiple at a time, which will counter this drawback significantly. – Robin Feb 08 '14 at 14:47
  • @warwaruk check my bellow answer please – Marco Dinatsoli Feb 08 '14 at 19:25
2

Something like (based on Robin's answer):

class PlayerSpider(Spider):

    def __init__(self):
        self.player_urls = []
        self.done = False  # flag to know when a player with bday out of range found

    def extract_player_urls(self, response):
        sel = Selector(response)
        self.player_urls.extend(extracted player links)

    def parse(self, response):
        self.extract_player_urls(response)
        for i in xrange(10):
            yield Request(self.player_urls.pop(), parse=self.parse_player)

    def parse_player(self, response):
        if self.done:
            return
        ... extract player birth date
        if bd_date not in range:
            self.done = True
            ... somehow clear downloader queue
            return

        ... create and fill item
        yield item
        yield Request(self.player_urls.pop(), parse=self.parse_player)
warvariuc
  • 57,116
  • 41
  • 173
  • 227
  • Thanks for your answer, where is the checking of date please? also, please would you write the very general algorith of your solution – Marco Dinatsoli Feb 08 '14 at 06:08
  • should the `yield item` allow me to write to json ? i mean is it like the return item ? i do know that i can't use return value and yield, that is why i am asking – Marco Dinatsoli Feb 08 '14 at 11:14
  • You cannot `return something` in a generator. If you yield a `Request` it's scheduled to be downloaded. If you yield an `Item` - it is sent to the pipeline. Everything should work as in other cases. – warvariuc Feb 08 '14 at 11:19
2

First of all, I want to thank @warwaruk, @Robin for helping me in this issue.

And the best thanks to my great teacher @pault

I found the solution and here is the algorithm:

  1. start scraping in the main page.
  2. extracting all the players' links.
  3. call back on each player's link to extract his information. and the request's meta includes: the number of players in the current main page and the position of the player that I want to scrap.
  4. In the callback for each player:

    4.1 extract player's information.

    4.2 check if the date in the rage, if no: do nothing, if yes: check if this is the last play in the main player list. if yes, callback to the second main page.

simple code

def parse(self, response):
    currentPlayer = 0
    for each player in Players:
        currentPlayer +=1
        yield Request(player.link, meta={'currentPlayer':currentPlayer, 'numberOfPlayers':len(Players),callback = self.parsePlayer)

def parsePlayer(self,response):
    currentPlayer = meta['currentPlayer]
    numberOfPlayers = meta['numberOfPlayers']
    extract player's information
    if player[date] in range:
        if currentPlayer == numberOfPlayers:
            yield(linkToNextMainPage, callback = self.parse)
            yield playerInformatoin #in order to be written in JSON file
        else:
            yield playerInformaton

It works perfectly :)

Community
  • 1
  • 1
Marco Dinatsoli
  • 10,322
  • 37
  • 139
  • 253
  • 1) You can factorize the end of your code: `yield playerInfo \n if currentPlayer == numberOfPlayers: \n yield(linkToNextMainPAge..)` 2) This actually doesn't answer your question, which mentioned only ONE main page. This is a big deal, because if you only need to stop at the _main page_ containing the first out of range link, you can yield all links from that page (which is quite more straightforward and what you're doing). Your question as we both understood it was about parsing one page and stop **mid-page** as soon as you are out of range, so not the same issue. – Robin Feb 08 '14 at 20:56
  • The solution you found is therefore more suited for your needs, but please update your question to make it obvious for anyone landing here. – Robin Feb 08 '14 at 20:58
  • I think you could do without all this `meta` stuff, because it's messy and it's not reliable anyway: you request next main page when you get a callback an a player with the "last" index. But because the downloading is asynchronous, it doesn't mean that if you got the "last" player it's the last callback from the current main page. It could happen that the last player page was loaded faster then the previous. But if it works for you - it's ok. Glad you found a solution for you – warvariuc Feb 09 '14 at 06:05
  • @warwaruk even if the "last" player in the main page wasn't the last call back, it will still true to crawl the next main page from the "last" player in the main page because the players are listed depending on their dates. So, it doesn't matter which call back call the "last" player. what matter is if the "last" player is in the range or not. right? – Marco Dinatsoli Feb 09 '14 at 17:51
  • It just can make one or more unnecessary requests to other main pages. Anyway, when in doubt, tests should help :) – warvariuc Feb 09 '14 at 18:04
  • @warwaruk would you follow me please http://stackoverflow.com/questions/21662689/scrapy-run-spider-from-script?noredirect=1#comment32743011_21662689 – Marco Dinatsoli Feb 09 '14 at 18:08