I am using a CrawlSpider that recursively follow links calling the next page using a link extraction like:
rules = (Rule(LinkExtractor(
allow=(),\
restrict_xpaths=('//a[contains(.,"anextpage")]')),\
callback='parse_method',\
follow=True),
)
I have applied this strategy to recursively crawl different websites, and as far as there was text in the html tag, like <a href="somelink">sometext</a>
, everything worked fine.
I am now trying to scrape a website that has an
<div class="bui-pagination__item bui-pagination__next-arrow">
<a class="pagenext" href="/url.html" aria-label="Pagina successiva">
<svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewBox="0 0 128 128">
<path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z"></path>
</svg>
</a>
</div>
as a 'next' button instead of simple text, and my LinkExtractor rule does not seem to apply anymore, and the spider stops after the first page.
I have tried to look for the svg element, but that doesn't seem to trigger the extraction:
restrict_xpaths=('//a[contains(.,name()=svg) and contains(@class,"nextpageclass")]'))
Is there anything I am missing?