2

I'm looking for a way to list all loaded files with the requests module. Like there is in chrome's Inspector Network tab, you can see all kinds of files that have been loaded by the webpage.

enter image description here

The problem is the file(in this case .pdf file) I want to fetch does not have a specific tab, and the webpage loads it by javascript and AJAX I guess, because even after the page loaded completely, I couldn't find a tag that has a link to the .pdf file or something like that, so every time I should goto Networks tab and reload the page and find the file in the loaded resources list. Is there any way to catch all the loaded files and list them using the Requests module?

soroushamdg
  • 191
  • 1
  • 2
  • 14
  • 1
    I think you need something like [selenium webdriver](https://selenium-python.readthedocs.io/) for dynamic content. – AlexNe Oct 10 '20 at 10:02
  • I've been using selenium but I guess it's not a solution to it, how can I access links using python code? Aren't there some limitations too? – soroushamdg Oct 10 '20 at 12:22
  • 1
    You could load up the website in chrome, find the bit of JavaScript that issues the get request for the pdf and replicate that in Python. But I doubt that’s feasible, especially since the website includes some sort of token with the request. – AlexNe Oct 10 '20 at 12:26

1 Answers1

0

When a browser loads an HTML file it then interprets the contents of that file. It may discover that there is a tag referencing an external JavaScript URL. The browser will then issue a GET request to retrieve that file. When said file is received, it hen interprets the JavaScript file by executing the code within. That code might contain AJAX code that in turn fetches more files. Or the HTML file may reference an extern CSS file with a tag or image file with an tag. These files will also be loaded by the browser and can be seen when you run the browser's inspector.

In contrast, when you do a get request with the requests module for a particular URL, only that one page is fetched. There is no logic to interpret the contents of the returned page and fetch those images, style sheets, JavaScript files, etc. that are referenced within the page.

You can, however, use Python to automate a browser using a tool such as Selenium WebDriver, which can be used to fully download a page.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • So if I use selenium webdriver, how can I access links using python code? I've considered it as a solution too, but isn't there some limitations too? – soroushamdg Oct 10 '20 at 11:56
  • 1
    Well, first I would suggest that you read the Selenium documentation. But if you are trying to open the chrome inspector, for example, you can add the following start up option: `'--auto-open-devtools-for-tabs`. See [How to open Chrome browser console through Selenium?](https://stackoverflow.com/questions/54589156/how-to-open-chrome-browser-console-through-selenium?rq=1) – Booboo Oct 10 '20 at 13:04