2

I wait to get the html web page from https://www.collinsdictionary.com/dictionary/english/supremacy, but part of the html file is loaded by javascript. When I use HTTP.jl to get the web page with HTTP.request(), I only get part of the html file that loaded before the javascript been run, so the web page I get is different to the web page I got from Chrome. How can I get the web page as same as Chrome get? Do I have to use WebDriver.jl with is a a wrapper around Selenium WebDriver's python bindings?

part of my source:

function get_page(w::word)::Bool
    response = nothing
    try
        response = HTTP.request("GET", "https://www.collinsdictionary.com/dictionary/$(dictionary)/$(w.org_word)",
                                                 connect_timeout=connect_timeout, readtimeout=readtimeout, retries=retries, redirect=true,proxy=proxy)
    catch e
        push!(w.err_log, [get_page_http_err, string(e)])
        return falses
    end
    open("./assets/org_page.html", "w") do f 
        write(f, String(response.body))
    end
    return true
end

dictionary and w.org_word are both String, the function is in a module.

xinyu
  • 118
  • 6
  • 1
    Could you post your code? Have you tried to simply add sleep or `waitFor` function to wait for content to load in your scraper? In other words you program might take the html before it's fully generated for that you need to explicitly wait for the website to load. – Granitosaurus Oct 10 '21 at 06:45
  • @Granitosaurus Thank you for your comment! I have posted my code. – xinyu Oct 10 '21 at 06:58

1 Answers1

1

What you want is impossible to achieve with just HTTP.jl. Running the Javascript part of the page is fundamentally different -- you need a Javascript engine to do so, which is nothing simple.

And this is not a unique weakness of Julia's HTTP: Python requests.get(url) returning javascript code instead of the page html

(recently the standard library request in python seems to added Javascript rendering ability)

jling
  • 2,160
  • 12
  • 20