How to get size of a file from Webpage in BeautifulSoup

Question

I am using BeautifulSoup in Python.

I want to get the size of a downloadable file from webpage. For example, this page has a link to download txt file (by clicking on "save"). How can I get the size (in Bytes) of that file (preferably without downloading it)?

If there is no option in BeautifulSoup, then please suggest other options within and outside of Python.

score 5 · Accepted Answer · answered May 19 '16 at 05:06

Using requests package, you can send a HEAD request to the URL which serves the text file and check the Content-Length in the header:

>>> url = "http://cancer.jpl.nasa.gov/fmprod/data?refIndex=0&productID=02965767-873d-11e5-a4ea-252aa26bb9af"
>>> res = requests.head(url)
>>> res.headers
{'content-length': '944', 'content-disposition': 'attachment; filename="Lab001_A_R03.txt"', 'server': 'Apache-Coyote/1.1', 'connection': 'close', 'date': 'Thu, 19 May 2016 05:04:45 GMT', 'content-type': 'text/plain; charset=UTF-8'}
>>> int(res.headers['content-length'])
944

As you can see the size is same as mentioned on the page.

Oh, I didn't see that this page mentions it already. But, will use this for other pages. Thanks! — dc95, May 19 '16 at 05:11

score 3 · Answer 2 · answered May 19 '16 at 05:26

3

Since page provides this information, if you believe it, you can extract it from page's body:

import re
import requests
from bs4 import BeautifulSoup


url = 'http://edrn.jpl.nasa.gov/ecas/data/product/02965767-873d-11e5-a4ea-252aa26bb9af/1'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')

p = re.compile(r'^(\d+) bytes$')
el = soup.find(text=p)
size = p.match(el.string).group(1)

print(size)  # 944

answered May 19 '16 at 05:26

Mikhail Gerasimov

36,989
16
116
159

Thanks! The other answer works better for me since I am also having float values in KB and MB. For others, if the value is in float, then try this: http://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string-in-python OR http://stackoverflow.com/questions/385558/extract-float-double-value – dc95 May 19 '16 at 19:22

How to get size of a file from Webpage in BeautifulSoup

2 Answers2