8

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot! Here are the codes for two request:

request1 = scrapy_splash.SplashRequest(
    'https://www.crowdfunder.com/user/following/{}'.format(user_id),
     self.parse_follow_relationship,
     args={'wait':2},
     meta={'user_id':user_id, 'action':'following'},
     endpoint='http://192.168.99.100:8050/render.html')

yield request1

request2 = scrapy_splash.SplashRequest(
    'https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
    self.parse_tmp,
    meta={'user_id':user_id, 'action':'following'},
    endpoint='http://192.168.99.100:8050/render.html')

yield request2

ajax request shown in browser console

Community
  • 1
  • 1
Bowen Liu
  • 99
  • 2
  • 7

3 Answers3

21

To scroll a page you can write a custom rendering script (see http://splash.readthedocs.io/en/stable/scripting-tutorial.html), something like this:

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 1.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end

To render this script use 'execute' endpoint instead of render.html endpoint:

script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
                            endpoint='execute', 
                            args={'wait':2, 'lua_source': script}, ...)
Community
  • 1
  • 1
Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
  • 1
    can you please guide where to write this script. I mean i am confused how can i write this javascript function in python file – Raheel Jul 24 '17 at 08:23
  • 1
    If this script reaches the end and then some javascript appends new content to the page, will the script scroll again and again until no more content is added? – Milos Jan 12 '18 at 01:26
4

Thanks Mikhail, I tried your scroll script, and it worked, but I also notice that your script scroll too much one time, some js have no time too render and is skipped, so I do some little change as follow:

function main(splash)
        local num_scrolls = 10
        local scroll_delay = 1

        local scroll_to = splash:jsfunc("window.scrollTo")
        local get_body_height = splash:jsfunc(
            "function() {return document.body.scrollHeight;}"
        )
        assert(splash:go(splash.args.url))
        splash:wait(splash.args.wait)

        for _ = 1, num_scrolls do
            local height = get_body_height()
            for i = 1, 10 do
                scroll_to(0, height * i/10)
                splash:wait(scroll_delay/10)
            end
        end        
        return splash:html()
end
Community
  • 1
  • 1
李东勇
  • 190
  • 1
  • 10
0

I do not think that setting the number of scrolls hard coded is a good idea for infinite scroll pages, so I modified the above-mentioned code like this:

function main(splash, args)
    
    current_scroll = 0  
  
    scroll_to = splash:jsfunc("window.scrollTo")
    get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(3)
  
    height = get_body_height()

    while current_scroll < height do
        scroll_to(0, get_body_height())
        splash:wait(5)
            current_scroll = height
            height = get_body_height()
    end 
    splash:set_viewport_full()
    return splash:html()
end
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 23 '22 at 12:20