3

I am using BeautifulSoup in Python.

I want to get the size of a downloadable file from webpage. For example, this page has a link to download txt file (by clicking on "save"). How can I get the size (in Bytes) of that file (preferably without downloading it)?

If there is no option in BeautifulSoup, then please suggest other options within and outside of Python.

dc95
  • 1,319
  • 1
  • 22
  • 44

2 Answers2

5

Using requests package, you can send a HEAD request to the URL which serves the text file and check the Content-Length in the header:

>>> url = "http://cancer.jpl.nasa.gov/fmprod/data?refIndex=0&productID=02965767-873d-11e5-a4ea-252aa26bb9af"
>>> res = requests.head(url)
>>> res.headers
{'content-length': '944', 'content-disposition': 'attachment; filename="Lab001_A_R03.txt"', 'server': 'Apache-Coyote/1.1', 'connection': 'close', 'date': 'Thu, 19 May 2016 05:04:45 GMT', 'content-type': 'text/plain; charset=UTF-8'}
>>> int(res.headers['content-length'])
944

As you can see the size is same as mentioned on the page.

AKS
  • 18,983
  • 3
  • 43
  • 54
3

Since page provides this information, if you believe it, you can extract it from page's body:

import re
import requests
from bs4 import BeautifulSoup


url = 'http://edrn.jpl.nasa.gov/ecas/data/product/02965767-873d-11e5-a4ea-252aa26bb9af/1'
content = requests.get(url).text
soup = BeautifulSoup(content, 'lxml')

p = re.compile(r'^(\d+) bytes$')
el = soup.find(text=p)
size = p.match(el.string).group(1)

print(size)  # 944
Mikhail Gerasimov
  • 36,989
  • 16
  • 116
  • 159
  • Thanks! The other answer works better for me since I am also having float values in KB and MB. For others, if the value is in float, then try this: http://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string-in-python OR http://stackoverflow.com/questions/385558/extract-float-double-value – dc95 May 19 '16 at 19:22