Looking at your example page, it has 3 hrefs that points to a file. When you look at a href, sometime you can tell it is a file based on the extension. But, in a normal scenario websites can do some serverside processing and then return a file. Sometimes the URLs are not even files, they are pointing to some other page.
So, you have two things to do.
- Retrieve all anchor tags and hrefs on a webpage. (You can use
BeautifulSoup for this step)
- Filter out file urls from html urls. (This is the tricky part. You can come across static assets like .js or .css or image files etc.)
To perform the second part, you can use python requests library to get the content type. Here is a small example:
In [3]: import requests
In [4]: response = requests.head('https://speed.hetzner.de/100MB.bin', allow_redirects=True)
In [5]: response
Out[5]: <Response [200]>
In [6]: response.content
Out[6]: b''
In [7]: response.headers
Out[7]: {'Server': 'nginx', 'Date': 'Tue, 07 May 2019 21:21:28 GMT', 'Content-Type': 'application/octet-stream', 'Content-Length': '104857600'
, 'Last-Modified': 'Tue, 08 Oct 2013 11:48:13 GMT', 'Connection': 'keep-alive', 'ETag': '"5253f0fd-6400000"', 'Strict-Transport-Security': 'ma
x-age=15768000; includeSubDomains', 'Accept-Ranges': 'bytes'}
If your look at the response.headers
here you can see the 'Content-type' which is set to 'application/octet-stream'
. This field should be used to filter out files. There are other content types that you have to look for, in order to decide if it is a downloadable or not. Once you have this filtered list, it is the list of downloadable files on this webpage.
Notice that I am using requests.head
to get the content type. Use HEAD request to get some meta information about a URL. If you do a GET/POST, it might timeout.