5

When making a request using the requests library to https://stackoverflow.com

page = requests.get(url='https://stackoverflow.com')
print(page.content)

I get the following:

<!DOCTYPE html>
    <html class="html__responsive html__unpinned-leftnav">
    <head>
        <title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a"> 
..........

These source code here have the absolute paths, but when running the same URL using requests-html with js rendering

with HTMLSession() as session:
    page = session.get('https://stackoverflow.com')
    page.html.render()
    print(page.content)

I get the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>StackOverflow.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" type="text/css" />
<link href="lib/window.css" rel="stylesheet" type="text/css" />
<link rel="icon" type="image/gif" href="favicon.gif"/>
..........

The links here are relative paths,

How can I get the source code with absolute paths like requests when using requests-html with js rendering?

Mezo
  • 163
  • 1
  • 20
  • Could you explain what do you mean by source-codes and a side by side comparison of difference in the response you get from both packages for a particular link? – Jarvis Dec 26 '20 at 17:02
  • source-codes: raw html of the page, the comparison codes are in the question above. When I use the library ```requests-html``` with js rendering to retrieve the raw HTML response of the page, In this raw HTML source code I get relative links in ```src``` or ```href``` attributes as in ```requests``` library I get the full absolute link. – Mezo Dec 27 '20 at 04:06
  • I'm not able to replicate this in the latest versions. I see the same output in both. – Amit Singh Dec 31 '20 at 20:23

2 Answers2

2

This should probably a feature request for the request-html developers. However for now we can achieve this with this hackish solution:

from requests_html import HTMLSession
from lxml import etree

with HTMLSession() as session:
    html = session.get('https://stackoverflow.com').html
    html.render()

    # iterate over all links
    for link in html.pq('a'):
        if "href" in link.attrib:
            # Make links absolute
            link.attrib["href"] = html._make_absolute(link.attrib["href"])

    # Print html with only absolute links
    print(etree.tostring(html.lxml).decode())

We change the html-objects underlying lxml tree, by iterating over all links and changing their location to absolute using the html-object's private _make_absolute function.

pascscha
  • 1,623
  • 10
  • 16
0

The documentation on the module in this link mentions a distinguishment between the absolute and relative links.

Quote:

Grab a list of all links on the page, in absolute form (anchors excluded):

r.html.absolute_links

Could you try this statement?

JustLudo
  • 1,690
  • 12
  • 29