I'm trying to get links from a page with LinkExtractor on a page with infinite scroll. Doing this with
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True),
)
works. However, this gets called without JavaScript, thus the images are not loading within the page (and their url, which I need). When changing the LinkExtractor to;
rules = (
Rule(LinkExtractor(allow=".*?(\/nl\/agenda\/).*"), callback='parse_item', follow=True, process_links='process_links'),
)
with;
def process_links(self, links):
for link in links:
link.url = "http://localhost:8050/render.html?" + urlencode({ 'url' : link.url })
return links
It only goes to the urls it loads when loading up the page (but it needs to get ALL the links which you can get with scrolling). For some reason it also loads some weird localhost URLs like so;
http://localhost:8050/render.html?url=http%3A%2F%2Flocalhost%3A8050%2Fnl%2Fagenda%2xxxxxx
Which I have no clue why it does that.
Is there a way to execute JavaScript when using the LinkExtractor
and Splash, so I can scroll and get all the links before the LinkExtractor
gets the links? Only executing JavaScript when following up links from the LinkExtractor
would also be enough, but I wouldn't know where to begin to do that.