1

Don't understand how to paginate for kimono scraping without next> in navigation i.e. for paging structure:

<div class="pages" style="clear: both;">
    <span>1</span>    
    <a href="/page=2">2</a>
    <a href="/page=3">3</a>
    <a href="/page=4">4</a>
</div>

xpath for css selector gives results only for page2:

div.pages > a

I want to have one API (i.e. don't want to generate URL list with additional API)

AndriuZ
  • 648
  • 6
  • 26
  • What exactly are you trying to achieve? As far as I remember, you can use **URL generator** in Kimono or supply custom links you want API to use. By the way, the query you use `div.pages > a` is *css selector* and not *XPath*. – Gabrielius Oct 30 '15 at 16:45
  • a) my problem is (as stated in question) to have one API with apropriate css selector or xpath (i.e. don't want to use generated URL list with additional API because it causes additional problems) b) by the way I can agree, that probably [div.pages] is _css selector_, but [div.pages > a] - is'nt. – AndriuZ Oct 31 '15 at 14:01
  • Unfortunately, `div.pages > a` is a *css selector*, which selects all `a` elements that are children of `div.pages` (take a look at [css selectors](http://www.w3schools.com/cssref/sel_element_gt.asp)). *XPath* syntax is different ([examples](http://www.w3schools.com/xsl/xpath_syntax.asp)). If you are trying to *page* and *scrape* in the same step, that's impossible to do. However, as I told you, you can generate URLs you need and use *one* API, by choosing *CRAWL STRATEGY: Generated URL list*. – Gabrielius Nov 02 '15 at 11:36
  • thanks for the link @Gabrielius, now I agree – AndriuZ Nov 02 '15 at 18:12

2 Answers2

1

You have two options.

(a) Try div.pages > span + a. This 'next page' selector will always select the 'next' page and will stop on the last page. The example markup shows that the currently selected page is a span and the next page link is an adjacent a. You can use the adjacent sibling selector + to select an a that comes after a span. Note: You didn't a link to the target site, so it's not guaranteed this will work, but based on your example markup, it would.

(b) Simple manually enter a list of URLs for this API to crawl. It looks like the list you'd want is:

http://www.thissiteurl.com/page=1
http://www.thissiteurl.com/page=2
http://www.thissiteurl.com/page=3
...
trip41
  • 1,201
  • 1
  • 8
  • 9
  • this is brilliant idea in general, but for [specific page](http://www.moreinspiration.com/Search?t=advertising&sort=addedon&page=2) not worked - all `a` remains in their positions, is it possible to make it other way incrementall? – AndriuZ Nov 02 '15 at 18:08
  • is there anything special about the element holding the 'current' page? a special `class`? special `id`? anything? – trip41 Nov 02 '15 at 23:50
  • @AndriusZ, I don't quite understand the point in selecting `a` elements if you don't use them as an API for source URLs, could you explain? Also, here is what I meant about [Generating URLs](https://help.kimonolabs.com/hc/en-us/articles/203257724-Generate-a-list-of-URLs-to-crawl-based-on-URL-parameters) - you can easily set `page` values as `from` and `to` range, avoiding the necessity to enter the links manually. – Gabrielius Nov 03 '15 at 09:11
  • @trip41 link to specific page is provided in first comment. As I can see Your suggestion to use first `a` after a `span` is good, but unfortunately works only for first next page – AndriuZ Nov 04 '15 at 20:07
  • @Gabrielius I know that way but want to learn how most elegant way replicate all linked structure in less steps possible so **my point asking help how not to use _second level_ generated URL's** is simple - Im already using _first level list_ of generated URL's also don't want to deal cumbersome nested structure and a lot of cross referenced pages. – AndriuZ Nov 04 '15 at 20:18
  • @trip41 (a) it works!!! - finally found reason not parsing other pages - it was my fault not resseting `/^()(3)()$/` – AndriuZ Nov 06 '15 at 07:15
0

Below you will find XPath and CSS selector to select all a elements meant for paging:

  • XPath: //descendant::*[1]/a[contains(@href, 'page=')]

  • CSS selector: div[id=results] div[class~=pull-right] a

div[class~=pull-right] means you want to select all divs that which class attribute equals to pull-right.

I don't quite like CSS selector, but Kimono does not allow a[href] type of selection for some reason. Ideally you would use something like this:

  • Better CSS selector: div[id=results] a[href=~page]
Gabrielius
  • 1,045
  • 12
  • 18
  • have you tried yourself? in kimono? results for both CSS are wrong (750 / 30 rows instead of 118) , also in second is error (must be `href~=page`) – AndriuZ Nov 06 '15 at 07:09