Scrapy LinkExtractor with svg element as Next button

Question

I am using a CrawlSpider that recursively follow links calling the next page using a link extraction like:

rules = (Rule(LinkExtractor(
               allow=(),\
               restrict_xpaths=('//a[contains(.,"anextpage")]')),\
               callback='parse_method',\
               follow=True),
        )

I have applied this strategy to recursively crawl different websites, and as far as there was text in the html tag, like <a href="somelink">sometext</a>, everything worked fine.

I am now trying to scrape a website that has an

<div class="bui-pagination__item bui-pagination__next-arrow"> <a class="pagenext" href="/url.html" aria-label="Pagina successiva"> <svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewBox="0 0 128 128"> <path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z"></path> </svg> </a> </div>

as a 'next' button instead of simple text, and my LinkExtractor rule does not seem to apply anymore, and the spider stops after the first page.

I have tried to look for the svg element, but that doesn't seem to trigger the extraction:

restrict_xpaths=('//a[contains(.,name()=svg) and contains(@class,"nextpageclass")]'))

Is there anything I am missing?

See https://stackoverflow.com/q/8550114/939364 – Gallaecio May 10 '19 at 11:58 — Gallaecio, May 10 '19 at 11:58
Without an example page you will not get much help. – Thierrydev May 15 '19 at 16:13 — Thierrydev, May 15 '19 at 16:13

score -1 · Answer 1 · answered Aug 11 '20 at 04:45

-1

That's most probably because the site uses javascript. You may need to use Splash to simulate clicks to navigate and return pre rendered websites. This is a good place to start:

https://docs.scrapy.org/en/latest/topics/dynamic-content.html

answered Aug 11 '20 at 04:45

Sujith Kumar

1
2

Scrapy LinkExtractor with svg element as Next button

1 Answers1