3

I'm trying to get links from a page with LinkExtractor on a page with infinite scroll. Doing this with

    rules = (
    Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True),
)

works. However, this gets called without JavaScript, thus the images are not loading within the page (and their url, which I need). When changing the LinkExtractor to;

rules = (
    Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True, process_links='process_links'),
)

with;

def process_links(self, links):
    for link in links:
        link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
    return links

It only goes to the urls it loads when loading up the page (but it needs to get ALL the links which you can get with scrolling). For some reason it also loads some weird localhost URLs like so;

http://localhost:8050/render.html?url=http%3A%2F%2Flocalhost%3A8050%2Fnl%2Fagenda%2xxxxxx

Which I have no clue why it does that.

Is there a way to execute JavaScript when using the LinkExtractor and Splash, so I can scroll and get all the links before the LinkExtractor gets the links? Only executing JavaScript when following up links from the LinkExtractor would also be enough, but I wouldn't know where to begin to do that.

Mees Kluivers
  • 520
  • 2
  • 6
  • 26
  • See https://stackoverflow.com/questions/8550114/can-scrapy-be-used-to-scrape-dynamic-content-from-websites-that-are-using-ajax and https://docs.scrapy.org/en/master/topics/dynamic-content.html – Gallaecio Jul 05 '19 at 13:47
  • Does this answer your question? [how does scrapy-splash handle infinite scrolling?](https://stackoverflow.com/questions/40325657/how-does-scrapy-splash-handle-infinite-scrolling) – Gallaecio Nov 20 '19 at 09:13

1 Answers1

0

Link extractor works on the current content not the content that render dynamically. And yes, as you say, for that, you are using splash but splash is used to render JavaScript code while virtual scrolling is never handled in splash, virtual scrolling is more like a network call to fetch new data and append it to the existing HTML. so when you scroll, find a call and then hit that call to get the desired data.

ThunderMind
  • 789
  • 5
  • 14