How can I get download links of files from a Webpage without scraping the document itself?

Question

I want to code a Download Manager in Python like JDownloader that downloads easy files for you. But not every file has a download url in the document. How can I get download url's if the files are like "invisible" in the document ? I found on the internet, that network sniffing is maybe working, but it doesn't seem to be the right thing I need. JDownloader is just checking for a second and directly finds what you need. How does this work ? For example: https://speed.hetzner.de/

I am a beginner btw.

You can check Pyload as Download Manager in Python like JDownloader. And it's open source. — thecoder, Sep 20 '19 at 23:28

score 0 · Answer 1 · answered May 07 '19 at 21:48

Looking at your example page, it has 3 hrefs that points to a file. When you look at a href, sometime you can tell it is a file based on the extension. But, in a normal scenario websites can do some serverside processing and then return a file. Sometimes the URLs are not even files, they are pointing to some other page.

So, you have two things to do.

Retrieve all anchor tags and hrefs on a webpage. (You can use BeautifulSoup for this step)
Filter out file urls from html urls. (This is the tricky part. You can come across static assets like .js or .css or image files etc.)

To perform the second part, you can use python requests library to get the content type. Here is a small example:

In [3]: import requests                                                                                                                       

In [4]: response = requests.head('https://speed.hetzner.de/100MB.bin', allow_redirects=True)                                                  

In [5]: response                                                                                                                              
Out[5]: <Response [200]>

In [6]: response.content                                                                                                                      
Out[6]: b''

In [7]: response.headers                                                                                                                      
Out[7]: {'Server': 'nginx', 'Date': 'Tue, 07 May 2019 21:21:28 GMT', 'Content-Type': 'application/octet-stream', 'Content-Length': '104857600'
, 'Last-Modified': 'Tue, 08 Oct 2013 11:48:13 GMT', 'Connection': 'keep-alive', 'ETag': '"5253f0fd-6400000"', 'Strict-Transport-Security': 'ma
x-age=15768000; includeSubDomains', 'Accept-Ranges': 'bytes'}

If your look at the response.headers here you can see the 'Content-type' which is set to 'application/octet-stream'. This field should be used to filter out files. There are other content types that you have to look for, in order to decide if it is a downloadable or not. Once you have this filtered list, it is the list of downloadable files on this webpage.

Notice that I am using requests.head to get the content type. Use HEAD request to get some meta information about a URL. If you do a GET/POST, it might timeout.

But imagine, you have a site where no url's are there, but still a video on the page. What I know is, that this Video url get's shown at the moment you click on the video — Alex, May 07 '19 at 21:58
You mean you need to perform a "user action" before you can "see" a url in the html document? — Adriano, May 08 '19 at 00:31
You can always parse the html and get the list of multimedia elements right? This is a very open ended question. I’d suggest you to do a little research and collect all the html elements that you want to want to support and look for ways similar to what suggested in my answer filter data that you collected from these tags as well. — mbhargav294, May 08 '19 at 04:43
You should start building it with the heeds and then later add features to support videos/gifs/music etc. — mbhargav294, May 08 '19 at 04:44

How can I get download links of files from a Webpage without scraping the document itself?

1 Answers1