0

I am new to scrapy learning. Want to crawl webpages. Before starting with complete project was exploring the Command Line. From the web page crawl I was able to extract the links under the H3 tag with below command

sel.xpath("//h3//@href").extract()

This extracted all the urls. But later realised that the links in the websites are paginated. I was able to know the total number of pages by manually go through pages. But I thought of extracting from the first page because it has the information at bottom as

Page 1 of 100

under a div tag

<div class="pagination-meta">
    Page 1 of 100
</div>

I tried with the following command for extracting the details. But it returned with [] alone. Please correct me if I am wrong

sel.xpath('//div[@class="pagination_meta"]/text()').extract();

I tried the below since the div of pagination-meta was under two other divs

<div class="search-pagination-top bb box-sizing-content">
    <div class="grid_3 column alpha tmargin">
        <div class="pagination-meta">
        Page 1 of 100
        </div>
    </div>
</div>


sel.xpath('//div[@class="search-pagination-top bb box-sizing-content"]//div/text()').extract();
    [u'Page 1 of 100']

Is this the correct way to do it? Why does not my first command did not give the exact content?

balaaagi
  • 502
  • 11
  • 21
  • Use [FirePath](https://addons.mozilla.org/en-US/firefox/addon/firepath/) extension for Firefox to debug your xpath expressions, but keep in mind, that some of them will differ in scrapy, because Firefox and other browsers may change the page structure (e.g. add `tbody` tags to tables) – warvariuc Jun 15 '14 at 09:38
  • Also, from my experience I have never needed the total page count. I prefer always to find link to the next page. – warvariuc Jun 15 '14 at 09:39
  • @warwaruk so how will I crawl all the paginated pages? – balaaagi Jun 21 '14 at 01:08
  • See [here](http://stackoverflow.com/questions/6591255/following-links-scrapy-web-crawler-framework/6593158#6593158) the code with `nextPageLink` – warvariuc Jun 21 '14 at 04:29

1 Answers1

1

It will work if you use:

sel.xpath('//div[@class="pagination-meta"]/text()').extract();

Since you are matching the exact string, an underscore and a dash certainly will make a difference.

There are many ways to reach the same result. The second way you did it is also correct. Many times it's necessary to obtain a context in one or more location steps, in order to navigate using a relative XPath expression to your final selection step. That happens when you have pages which may change, or a structure which may change.

helderdarocha
  • 23,209
  • 4
  • 50
  • 65