5

I am using a CrawlSpider that recursively follow links calling the next page using a link extraction like:

rules = (Rule(LinkExtractor(
               allow=(),\
               restrict_xpaths=('//a[contains(.,"anextpage")]')),\
               callback='parse_method',\
               follow=True),
        )

I have applied this strategy to recursively crawl different websites, and as far as there was text in the html tag, like <a href="somelink">sometext</a>, everything worked fine.

I am now trying to scrape a website that has an

<div class="bui-pagination__item bui-pagination__next-arrow"> <a class="pagenext" href="/url.html" aria-label="Pagina successiva"> <svg class="bk-icon -iconset-navarrow_right bui-pagination__icon" height="18" role="presentation" width="18" viewBox="0 0 128 128"> <path d="M54.3 96a4 4 0 0 1-2.8-6.8L76.7 64 51.5 38.8a4 4 0 0 1 5.7-5.6L88 64 57.2 94.8a4 4 0 0 1-2.9 1.2z"></path> </svg> </a> </div>

as a 'next' button instead of simple text, and my LinkExtractor rule does not seem to apply anymore, and the spider stops after the first page.

I have tried to look for the svg element, but that doesn't seem to trigger the extraction:

restrict_xpaths=('//a[contains(.,name()=svg) and contains(@class,"nextpageclass")]'))

Is there anything I am missing?

user299791
  • 2,021
  • 3
  • 31
  • 57

1 Answers1

-1

That's most probably because the site uses javascript. You may need to use Splash to simulate clicks to navigate and return pre rendered websites. This is a good place to start:

https://docs.scrapy.org/en/latest/topics/dynamic-content.html