Better way to convert file sizes in Python

Question

I am using a library that reads a file and returns its size in bytes.

This file size is then displayed to the end user; to make it easier for them to understand it, I am explicitly converting the file size to MB by dividing it by 1024.0 * 1024.0. Of course this works, but I am wondering is there a better way to do this in Python?

By better, I mean perhaps a stdlib function that can manipulate sizes according to the type I want. Like if I specify MB, it automatically divides it by 1024.0 * 1024.0. Somethign on these lines.

So write one. Also note that many systems now use MB to mean 10^6 instead of 2^20. — tc., Mar 04 '11 at 13:08
@A A, @tc: Please keep in mind that the SI and IEC Norm is `kB (Kilo) for 1.000 Byte` and `KiB (Kibi) for 1.024 Byte`. See http://en.wikipedia.org/wiki/Kibibyte . — Bobby, Mar 04 '11 at 13:12
@Bobby: kB actually means "kilobel", equal to 10000 dB. There is no SI unit for byte. IIRC, the IEC recommends KiB but does not define kB or KB. — tc., Mar 12 '11 at 04:41
@tc. The prefix kilo is defined by SI to mean 1000. The IEC defined kB, etc. to use the SI prefix instead of 2^10. — ford, Feb 11 '13 at 23:03
@fizzisist: Cite? The IEC has established KiB/MiB/etc, but to my knowledge there's no international standard specifying kB/MB/etc apart from SI, where kB/MB mean kilobel/megabel (just as dB means decibel). It would be unwise, anyway, since MB/GB has long been ambiguous. — tc., Feb 18 '13 at 16:45
I mean the prefixes are defined generally by SI, but the abbreviations for data size are not: http://physics.nist.gov/cuu/Units/prefixes.html. Those are defined by IEC: http://physics.nist.gov/cuu/Units/binary.html — ford, Feb 19 '13 at 16:40
possible duplicate of [Reusable library to get human readable version of file size?](http://stackoverflow.com/questions/1094841/reusable-library-to-get-human-readable-version-of-file-size) — Martin Thoma, Apr 27 '15 at 15:19
@tc. SI clearly states "SI prefixes refer strictly to powers of 10, and should not be used for powers of 2. For example, 1 kilobit should not be used to represent 1024 bits" — endolith, Jan 17 '17 at 22:53

score 252 · Answer 1 · edited Apr 22 '17 at 19:40

252

Here is what I use:

import math

def convert_size(size_bytes):
   if size_bytes == 0:
       return "0B"
   size_name = ("B", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
   i = int(math.floor(math.log(size_bytes, 1024)))
   p = math.pow(1024, i)
   s = round(size_bytes / p, 2)
   return "%s %s" % (s, size_name[i])

NB : size should be sent in Bytes.

edited Apr 22 '17 at 19:40

vallentin

23,478
6
59
81

answered Feb 11 '13 at 22:34

James Sapam

16,036
12
50
73

12

If you're sending size in bytes then just add "B" as the first element of size_name. – tuxGurl Mar 25 '13 at 17:13
When you have 0 sized byte of file, it fails. log(0, 1024) is not defined! You should check 0 byte case before this statement i = int(math.floor(math.log(size,1024))). – genclik27 May 07 '14 at 13:57
genclik - you're right. I've just submitted a minor edit which will fix this, and enable conversion from bytes. Thanks, Sapam, for the original – FarmerGedden Aug 22 '14 at 09:37
HI @WHK as tuxGurl mentioned its an easy fix. – James Sapam Jan 25 '16 at 02:03
8

Actually the size names would need to be ("B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"). See https://en.wikipedia.org/wiki/Mebibyte for more information. – Alex Dec 06 '19 at 15:08
Why do you use math.log() with 1024? Isn't 1024 bytes == 1 kilobyte? – Ryan Glenn Aug 06 '22 at 21:08

Lennart Regebro · Accepted Answer · 2011-03-04T13:42:00.623

145

There is hurry.filesize that will take the size in bytes and make a nice string out if it.

>>> from hurry.filesize import size
>>> size(11000)
'10K'
>>> size(198283722)
'189M'

Or if you want 1K == 1000 (which is what most users assume):

>>> from hurry.filesize import size, si
>>> size(11000, system=si)
'11K'
>>> size(198283722, system=si)
'198M'

It has IEC support as well (but that wasn't documented):

>>> from hurry.filesize import size, iec
>>> size(11000, system=iec)
'10Ki'
>>> size(198283722, system=iec)
'189Mi'

Because it's written by the Awesome Martijn Faassen, the code is small, clear and extensible. Writing your own systems is dead easy.

Here is one:

mysystem = [
    (1024 ** 5, ' Megamanys'),
    (1024 ** 4, ' Lotses'),
    (1024 ** 3, ' Tons'), 
    (1024 ** 2, ' Heaps'), 
    (1024 ** 1, ' Bunches'),
    (1024 ** 0, ' Thingies'),
    ]

Used like so:

>>> from hurry.filesize import size
>>> size(11000, system=mysystem)
'10 Bunches'
>>> size(198283722, system=mysystem)
'189 Heaps'

edited Mar 04 '11 at 13:42

answered Mar 04 '11 at 13:30

Lennart Regebro

167,292
41
224
251

Perfect! More than wanting to make it work for my case, I wanted to know it there was something like this. – user225312 Mar 04 '11 at 13:45
2

Hm, now I need one to go the other way. From "1 kb" to `1024` (an int). – mlissner Jul 09 '18 at 19:06
2

Works only in python 2 – e-info128 Feb 18 '19 at 17:15
7

This package might be cool but the odd license and the fact that there is no online available source code make it something I'd be very happy to avoid. And also it seems to support python2 only. – Almog Cohen Apr 02 '19 at 23:42
2

@AlmogCohen the source is online, available straight from PyPI (some packages do not have a Github repository, just a PyPI page) and the license is not that obscure, ZPL is the Zope Public License, which is, to the best of my knowledge, BSD-like. I do agree that the licensing itself is odd: there is no standard 'LICENSE.txt' file, nor is there a preamble at the top of each source file. – sleblanc May 20 '19 at 18:46
Is it possible to get all conversion in `MB` using `hurry.filesize`? @Lennart Regebro – alper Mar 02 '20 at 17:28
@alper Yes. You can set up whatever system you want. If you ONLY want MB, you don't need hurry.filesize, then just divide it with a megabyte. – Lennart Regebro Mar 09 '20 at 14:10
1

In order to get megabyte I did following equation using bitwise shifting operator: `MBFACTOR = float(1 << 20); mb= int(size_in_bytes) / MBFACTOR` @LennartRegebro – alper Mar 09 '20 at 14:26
I found this to be not so accurate. for 1.94 GB the size() only provided 1G, no matter I used verbose or not. only after using @James Idea, I got the right answer – soBusted Nov 07 '21 at 20:03
Yeah, in effect it rounds down. – Lennart Regebro Nov 11 '21 at 12:58

ccpizza · Answer 3 · 2019-08-16T14:28:32.897

Instead of a size divisor of 1024 * 1024 you could use the << bitwise shifting operator, i.e. 1<<20 to get megabytes, 1<<30 to get gigabytes, etc.

In the simplest scenario you can have e.g. a constant MBFACTOR = float(1<<20) which can then be used with bytes, i.e.: megas = size_in_bytes/MBFACTOR.

Megabytes are usually all that you need, or otherwise something like this can be used:

# bytes pretty-printing
UNITS_MAPPING = [
    (1<<50, ' PB'),
    (1<<40, ' TB'),
    (1<<30, ' GB'),
    (1<<20, ' MB'),
    (1<<10, ' KB'),
    (1, (' byte', ' bytes')),
]


def pretty_size(bytes, units=UNITS_MAPPING):
    """Get human-readable file sizes.
    simplified version of https://pypi.python.org/pypi/hurry.filesize/
    """
    for factor, suffix in units:
        if bytes >= factor:
            break
    amount = int(bytes / factor)

    if isinstance(suffix, tuple):
        singular, multiple = suffix
        if amount == 1:
            suffix = singular
        else:
            suffix = multiple
    return str(amount) + suffix

print(pretty_size(1))
print(pretty_size(42))
print(pretty_size(4096))
print(pretty_size(238048577))
print(pretty_size(334073741824))
print(pretty_size(96995116277763))
print(pretty_size(3125899904842624))

## [Out] ###########################
1 byte
42 bytes
4 KB
227 MB
311 GB
88 TB
2 PB

@Tjorriemorrie: it must be a left shift, right shifting will drop the only bit off and will result in `0`. — ccpizza, Apr 06 '18 at 13:47
i know this is old, but would this be correct usage? def convert_to_mb(data_b): print(data_b/(1 << 20)) — roastbeeef, Mar 27 '19 at 12:25
@roastbeef yes that's correct. And I like this answer for this purpose, I also had to change bytes to megabytes only. — Matthias, Dec 06 '22 at 08:58

Peter F · Answer 4 · 2022-10-09T06:44:27.980

Here are some easy-to-copy one liners to use if you already know what unit size you want. If you're looking for in a more generic function with a few nice options, see my FEB 2021 update further on...

Bytes

print(f"{os.path.getsize(filepath):,} B")

Kilobits

print(f"{os.path.getsize(filepath)/(1<<7):,.0f} kb")

Kilobytes

print(f"{os.path.getsize(filepath)/(1<<10):,.0f} KB")

Megabits

print(f"{os.path.getsize(filepath)/(1<<17):,.0f} mb")

Megabytes

print(f"{os.path.getsize(filepath)/(1<<20):,.0f} MB")

Gigabits

print(f"{os.path.getsize(filepath)/(1<<27):,.0f} gb")

Gigabytes

print(f"{os.path.getsize(filepath)/(1<<30):,.0f} GB")

Terabytes

print(f"{os.path.getsize(filepath)/(1<<40):,.0f} TB")

UPDATE FEB 2021 Here are my updated and fleshed-out functions to a) get file/folder size, b) convert into desired units:

from pathlib import Path

def get_path_size(path = Path('.'), recursive=False):
    """
    Gets file size, or total directory size

    Parameters
    ----------
    path: str | pathlib.Path
        File path or directory/folder path

    recursive: bool
        True -> use .rglob i.e. include nested files and directories
        False -> use .glob i.e. only process current directory/folder

    Returns
    -------
    int:
        File size or recursive directory size in bytes
        Use cleverutils.format_bytes to convert to other units e.g. MB
    """
    path = Path(path)
    if path.is_file():
        size = path.stat().st_size
    elif path.is_dir():
        path_glob = path.rglob('*.*') if recursive else path.glob('*.*')
        size = sum(file.stat().st_size for file in path_glob)
    return size


def format_bytes(bytes, unit, SI=False):
    """
    Converts bytes to common units such as kb, kib, KB, mb, mib, MB

    Parameters
    ---------
    bytes: int
        Number of bytes to be converted

    unit: str
        Desired unit of measure for output


    SI: bool
        True -> Use SI standard e.g. KB = 1000 bytes
        False -> Use JEDEC standard e.g. KB = 1024 bytes

    Returns
    -------
    str:
        E.g. "7 MiB" where MiB is the original unit abbreviation supplied
    """
    if unit.lower() in "b bit bits".split():
        return f"{bytes*8} {unit}"
    unitN = unit[0].upper()+unit[1:].replace("s","")  # Normalised
    reference = {"Kb Kib Kibibit Kilobit": (7, 1),
                 "KB KiB Kibibyte Kilobyte": (10, 1),
                 "Mb Mib Mebibit Megabit": (17, 2),
                 "MB MiB Mebibyte Megabyte": (20, 2),
                 "Gb Gib Gibibit Gigabit": (27, 3),
                 "GB GiB Gibibyte Gigabyte": (30, 3),
                 "Tb Tib Tebibit Terabit": (37, 4),
                 "TB TiB Tebibyte Terabyte": (40, 4),
                 "Pb Pib Pebibit Petabit": (47, 5),
                 "PB PiB Pebibyte Petabyte": (50, 5),
                 "Eb Eib Exbibit Exabit": (57, 6),
                 "EB EiB Exbibyte Exabyte": (60, 6),
                 "Zb Zib Zebibit Zettabit": (67, 7),
                 "ZB ZiB Zebibyte Zettabyte": (70, 7),
                 "Yb Yib Yobibit Yottabit": (77, 8),
                 "YB YiB Yobibyte Yottabyte": (80, 8),
                 }
    key_list = '\n'.join(["     b Bit"] + [x for x in reference.keys()]) +"\n"
    if unitN not in key_list:
        raise IndexError(f"\n\nConversion unit must be one of:\n\n{key_list}")
    units, divisors = [(k,v) for k,v in reference.items() if unitN in k][0]
    if SI:
        divisor = 1000**divisors[1]/8 if "bit" in units else 1000**divisors[1]
    else:
        divisor = float(1 << divisors[0])
    value = bytes / divisor
    return f"{value:,.0f} {unitN}{(value != 1 and len(unitN) > 3)*'s'}"


# Tests 
>>> assert format_bytes(1,"b") == '8 b'
>>> assert format_bytes(1,"bits") == '8 bits'
>>> assert format_bytes(1024, "kilobyte") == "1 Kilobyte"
>>> assert format_bytes(1024, "kB") == "1 KB"
>>> assert format_bytes(7141000, "mb") == '54 Mb'
>>> assert format_bytes(7141000, "mib") == '54 Mib'
>>> assert format_bytes(7141000, "Mb") == '54 Mb'
>>> assert format_bytes(7141000, "MB") == '7 MB'
>>> assert format_bytes(7141000, "mebibytes") == '7 Mebibytes'
>>> assert format_bytes(7141000, "gb") == '0 Gb'
>>> assert format_bytes(1000000, "kB") == '977 KB'
>>> assert format_bytes(1000000, "kB", SI=True) == '1,000 KB'
>>> assert format_bytes(1000000, "kb") == '7,812 Kb'
>>> assert format_bytes(1000000, "kb", SI=True) == '8,000 Kb'
>>> assert format_bytes(125000, "kb") == '977 Kb'
>>> assert format_bytes(125000, "kb", SI=True) == '1,000 Kb'
>>> assert format_bytes(125*1024, "kb") == '1,000 Kb'
>>> assert format_bytes(125*1024, "kb", SI=True) == '1,024 Kb'

UPDATE OCT 2022

My answer to a recent comment was too long, so here's some further explanation of the 1<<20 magic! I also notice that float isn't needed so I've removed that from the examples above.

As stated in another reply (above) "<<" is called a "bitwise operator". It converts the left hand side to binary and moves the binary digits 20 places to the left (in this case). When we count normally in decimal, the total number of digits dictates whether we've reached the tens, hundreds, thousands, millions etc. Similar thing in binary except the number of digits dictates whether we're talking bits, bytes, kilobytes, megabytes etc. So.... 1<<20 is actually the same as (binary) 1 with 20 (binary) zeros after it, or if you remember how to convert from binary to decimal: 2 to the power of 20 (2**20) which equals 1048576. In the snippets above, os.path.getsize returns a value in BYTES and 1048576 bytes are strictly speaking a Mebibyte (MiB) and casually speaking a Megabyte (MB).

That's a pretty clever way to do that. I wonder if you could put these into a function where you pass in whether you want kb's. mb's and so-on. You could even have an input command that asks which one you want, which would be pretty convenient if you do this a lot. — Hildy, Oct 07 '18 at 02:09
See above, Hildy... You can also customise the dictionary line like @lennart-regebro outlined above... which could be useful for storage management e.g. "Partition", "Cluster", "4TB Disks", "DVD_RW", "Blu-Ray Disk", "1GB memory sticks" or whatever. — Peter F, Oct 07 '18 at 05:41
I've also just added Kb (Kilobit), Mb (Megabit), and Gb (Gigabit) - users often get those confused in terms of network or file-transfer speeds, so thought it might be handy. — Peter F, Oct 07 '18 at 06:01
I love the one-liners, consider condensing with f-strings, e.g.: `f'{os.path.getsize(filepath)/float(1<<20):.0f} MB'` — pan0ramic, Oct 31 '21 at 19:42
I've updated the post and answered your question at the end Mykola. Hope that helps? — Peter F, Oct 09 '22 at 06:40

score 29 · Answer 5 · edited Dec 08 '15 at 18:53

29

Here is the compact function to calculate size

def GetHumanReadable(size,precision=2):
    suffixes=['B','KB','MB','GB','TB']
    suffixIndex = 0
    while size > 1024 and suffixIndex < 4:
        suffixIndex += 1 #increment the index of the suffix
        size = size/1024.0 #apply the division
    return "%.*f%s"%(precision,size,suffixes[suffixIndex])

For more detailed output and vice versa operation please refer: http://code.activestate.com/recipes/578019-bytes-to-human-human-to-bytes-converter/

edited Dec 08 '15 at 18:53

Community

1
1

answered Aug 14 '15 at 12:00

Pavan Gupta

17,663
4
22
29

The while statement should be changed to `while size >= 1024 and index < len(suffixes):`, otherwise the function would return `1024.0KB ` instead of `1.0MB` for example. – AnythingIsFine May 14 '21 at 13:30

rhoitjadhav · Answer 6 · 2021-04-02T06:42:46.287

22

Here it is:

def convert_bytes(size):
    for x in ['bytes', 'KB', 'MB', 'GB', 'TB']:
        if size < 1024.0:
            return "%3.1f %s" % (size, x)
        size /= 1024.0

    return size

Output

>>> convert_bytes(1024)
'1.0 KB'
>>> convert_bytes(102400)
'100.0 KB'

edited Apr 02 '21 at 06:42

answered Dec 04 '19 at 11:10

rhoitjadhav

691
7
16

4

That MiB, not MB and so on ... – Bouncner Apr 26 '21 at 08:56

score 12 · Answer 7 · answered Oct 31 '16 at 17:35

12

Just in case anyone's searching for the reverse of this problem (as I sure did) here's what works for me:

def get_bytes(size, suffix):
    size = int(float(size))
    suffix = suffix.lower()

    if suffix == 'kb' or suffix == 'kib':
        return size << 10
    elif suffix == 'mb' or suffix == 'mib':
        return size << 20
    elif suffix == 'gb' or suffix == 'gib':
        return size << 30

    return False

answered Oct 31 '16 at 17:35

Romeo Mihalcea

9,714
12
50
102

1

You are not handling the case of decimal numbers like 1.5GB. To fix it just change the `<< 10` to `* 1024`, `<< 20` to `* 1024**2` and `<< 30` to `* 1024**3`. – E235 Mar 13 '19 at 16:02

score 6 · Answer 8 · answered Jan 07 '20 at 18:26

6

UNITS = {1000: ['KB', 'MB', 'GB'],
            1024: ['KiB', 'MiB', 'GiB']}

def approximate_size(size, flag_1024_or_1000=True):
    mult = 1024 if flag_1024_or_1000 else 1000
    for unit in UNITS[mult]:
        size = size / mult
        if size < mult:
            return '{0:.3f} {1}'.format(size, unit)

approximate_size(2123, False)

answered Jan 07 '20 at 18:26

kamran kausar

4,117
1
23
17

this is usable in so many settings. glad i came across this comment. thanks a lot. – Saurabh Jain May 13 '20 at 05:23
yah this is pretty sweet and does not require outside libs – chowpay Jan 27 '21 at 03:09

score 2 · Answer 9 · answered Nov 06 '17 at 16:20

Here my two cents, which permits casting up and down, and adds customizable precision:

def convertFloatToDecimal(f=0.0, precision=2):
    '''
    Convert a float to string of decimal.
    precision: by default 2.
    If no arg provided, return "0.00".
    '''
    return ("%." + str(precision) + "f") % f

def formatFileSize(size, sizeIn, sizeOut, precision=0):
    '''
    Convert file size to a string representing its value in B, KB, MB and GB.
    The convention is based on sizeIn as original unit and sizeOut
    as final unit. 
    '''
    assert sizeIn.upper() in {"B", "KB", "MB", "GB"}, "sizeIn type error"
    assert sizeOut.upper() in {"B", "KB", "MB", "GB"}, "sizeOut type error"
    if sizeIn == "B":
        if sizeOut == "KB":
            return convertFloatToDecimal((size/1024.0), precision)
        elif sizeOut == "MB":
            return convertFloatToDecimal((size/1024.0**2), precision)
        elif sizeOut == "GB":
            return convertFloatToDecimal((size/1024.0**3), precision)
    elif sizeIn == "KB":
        if sizeOut == "B":
            return convertFloatToDecimal((size*1024.0), precision)
        elif sizeOut == "MB":
            return convertFloatToDecimal((size/1024.0), precision)
        elif sizeOut == "GB":
            return convertFloatToDecimal((size/1024.0**2), precision)
    elif sizeIn == "MB":
        if sizeOut == "B":
            return convertFloatToDecimal((size*1024.0**2), precision)
        elif sizeOut == "KB":
            return convertFloatToDecimal((size*1024.0), precision)
        elif sizeOut == "GB":
            return convertFloatToDecimal((size/1024.0), precision)
    elif sizeIn == "GB":
        if sizeOut == "B":
            return convertFloatToDecimal((size*1024.0**3), precision)
        elif sizeOut == "KB":
            return convertFloatToDecimal((size*1024.0**2), precision)
        elif sizeOut == "MB":
            return convertFloatToDecimal((size*1024.0), precision)

Add TB, etc, as you wish.

I will vote this up because it can be worked out just with the python standard library — Ciasto piekarz, Aug 07 '18 at 05:34

score 2 · Answer 10 · answered Nov 30 '20 at 02:24

I wanted 2 way conversion, and I wanted to use Python 3 format() support to be most pythonic. Maybe try datasize library module? https://pypi.org/project/datasize/

$ pip install -qqq datasize
$ python
...
>>> from datasize import DataSize
>>> 'My new {:GB} SSD really only stores {:.2GiB} of data.'.format(DataSize('750GB'),DataSize(DataSize('750GB') * 0.8))
'My new 750GB SSD really only stores 558.79GiB of data.'

score 1 · Answer 11 · answered Dec 01 '18 at 00:31

Here's a version that matches the output of ls -lh.

def human_size(num: int) -> str:
    base = 1
    for unit in ['B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y']:
        n = num / base
        if n < 9.95 and unit != 'B':
            # Less than 10 then keep 1 decimal place
            value = "{:.1f}{}".format(n, unit)
            return value
        if round(n) < 1000:
            # Less than 4 digits so use this
            value = "{}{}".format(round(n), unit)
            return value
        base *= 1024
    value = "{}{}".format(round(n), unit)
    return value

score -1 · Answer 12 · answered May 20 '19 at 22:17

Here is my implementation:

from bisect import bisect

def to_filesize(bytes_num, si=True):
    decade = 1000 if si else 1024
    partitions = tuple(decade ** n for n in range(1, 6))
    suffixes = tuple('BKMGTP')

    i = bisect(partitions, bytes_num)
    s = suffixes[i]

    for n in range(i):
        bytes_num /= decade

    f = '{:.3f}'.format(bytes_num)

    return '{}{}'.format(f.rstrip('0').rstrip('.'), s)

It will print up to three decimals and it strips trailing zeros and periods. The boolean parameter si will toggle usage of 10-based vs. 2-based size magnitude.

This is its counterpart. It allows to write clean configuration files like {'maximum_filesize': from_filesize('10M'). It returns an integer that approximates the intended filesize. I am not using bit shifting because the source value is a floating point number (it will accept from_filesize('2.15M') just fine). Converting it to an integer/decimal would work but makes the code more complicated and it already works as it is.

def from_filesize(spec, si=True):
    decade = 1000 if si else 1024
    suffixes = tuple('BKMGTP')

    num = float(spec[:-1])
    s = spec[-1]
    i = suffixes.index(s)

    for n in range(i):
        num *= decade

    return int(num)