6

Title says it all really, but I'm currently using a simple function with a case statement to convert human-readable file size strings into a size in bytes. It works well enough, but it's a bit unwieldy for porting into other code, so I'm curious to know if there are any widely available commands that a shell script could use instead?

Basically I want to take strings such as "100g" or "100gb" and convert them into bytes.

I'm currently doing the following:

to_bytes() {
    value=$(echo "$1" | sed 's/[^0123456789].*$//g')
    units=$(echo "$1" | sed 's/^[0123456789]*//g' | tr '[:upper:]' '[:lower:]')

    case "$units" in
        t|tb)   let 'value *= 1024 * 1024 * 1024 * 1024'    ;;
        g|gb)   let 'value *= 1024 * 1024 * 1024'   ;;
        m|mb)   let 'value *= 1024 * 1024'  ;;
        k|kb)   let 'value *= 1024' ;;
        b|'')   let 'value += 0'    ;;
        *)
                value=
                echo "Unsupported units '$units'" >&2
        ;;
    esac

    echo "$value"
}

It seems a bit overkill for something I would have thought was fairly common for scripts working with files; common enough that something might exist to do this more quickly.

If there are no widely available solutions (i.e - majority of unix and linux flavours) then I'd still appreciate any tips for optimising the above function as I'd like to make it smaller and easier to re-use.

Haravikk
  • 3,109
  • 1
  • 33
  • 46
  • 2
    You should probably use `*)` for the default case; if someone writes 10GiB, for example, the output would be 10, not the message. I'm not aware of a standard program to do the job. – Jonathan Leffler Jul 12 '13 at 13:31
  • For your second regexp `sed` will need `-re` **or** you need to escape the `+` (for one or more). Atleast in `GNU sed 4.2.2`. – Jite Jul 12 '13 at 13:47
  • There is such a tool; it is call `numfmt`. You may need to change to uppercase first. See the question [Convert between byte count and “human-readable” string](http://stackoverflow.com/questions/37015073/convert-between-byte-count-and-human-readable-string). – Svaberg May 28 '16 at 22:14

6 Answers6

8

See man numfmt.

# numfmt --from=iec 42 512K 10M 7G 3.5T
42
524288
10485760
7516192768
3848290697216

# numfmt --to=iec 42 524288 10485760 7516192768 3848290697216
42
512K
10M
7.0G
3.5T
Alex Offshore
  • 671
  • 6
  • 7
4
toBytes() {
 echo $1 | echo $((`sed 's/.*/\L\0/;s/t/Xg/;s/g/Xm/;s/m/Xk/;s/k/X/;s/b//;s/X/ *1024/g'`))
}
eplictical
  • 583
  • 1
  • 6
  • 16
  • 1
    Sorry I'm late to the party, but I just had to do this myself – eplictical Jun 18 '14 at 15:48
  • That is impressively compact! :D – David Gardner Oct 07 '16 at 08:59
  • This is really great, but I'm unsure what the first statement is intended to do? It doesn't seem to work on Mac. I modified it to the following pattern though which works like a charm: `s/[ ]*//g;s/[bB]$//;s/^.*[^0-9gkmt].*$//;s/t$/Xg/;s/g$/Xm/;s/m$/Xk/;s/k$/X/;s/X/*1024/g` – Haravikk Jul 28 '17 at 08:36
2

Here's something I wrote. It supports k, KB, and KiB. (It doesn't distinguish between powers of two and powers of ten suffixes, though, as in 1KB = 1000 bytes, 1KiB = 1024 bytes.)

#!/bin/bash

parseSize() {(
    local SUFFIXES=('' K M G T P E Z Y)
    local MULTIPLIER=1

    shopt -s nocasematch

    for SUFFIX in "${SUFFIXES[@]}"; do
        local REGEX="^([0-9]+)(${SUFFIX}i?B?)?\$"

        if [[ $1 =~ $REGEX ]]; then
            echo $((${BASH_REMATCH[1]} * MULTIPLIER))
            return 0
        fi

        ((MULTIPLIER *= 1024))
    done

    echo "$0: invalid size \`$1'" >&2
    return 1
)}

Notes:

  • Leverages bash's =~ regex operator, which stores matches in an array named BASH_REMATCH.
  • Notice the cleverly-hidden parentheses surrounding the function body. They're there to keep shopt -s nocasematch from leaking out of the function.
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
0

don't know if this is ok:

awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
    sub(/b$/,"")
        sub(/g/,"*"g)
        sub(/k/,"*"k)
        sub(/m/,"*"m)
        sub(/t/,"*"t)
"echo "$0"|bc"|getline r; print r; exit;}
{print "invalid input"}'
  • this only handles single line input. if multilines are needed, remove the exit
  • this checks only pattern [kgmt] and optional b. e.g. kib, mib would fail. also currently is only for lower-case.

e.g.:

kent$  echo "200kb"|awk 'BEGIN{b=1;k=1024;m=k*k;g=k^3;t=k^4}                                                                                                                
/^[0-9.]+[kgmt]?b?$/&&/[kgmtb]$/{
    sub(/b$/,"")
        sub(/g/,"*"g)
        sub(/k/,"*"k)
        sub(/m/,"*"m)
        sub(/t/,"*"t)
"echo "$0"|bc"|getline r
print r; exit
}{print "invalid input"}'
204800
Kent
  • 189,393
  • 32
  • 233
  • 301
0

Okay, so it sounds like there's nothing built-in or widely available, which is a shame, so I've had a go at reducing the size of the function and come up with something that's only really 4 lines long, though it's a pretty complicated four lines!

I'm not sure if it's suitable as an answer to my original question as it's not really what I'd call the simplest method, but I want to put it up in case anyone thinks it's a useful solution, and it does have the advantage of being really short.

#!/bin/sh
to_bytes() {
    units=$(echo "$1" | sed 's/^[0123456789]*//' | tr '[:upper:]' '[:lower:]')
    index=$(echo "$units" | awk '{print index ("bkmgt kbgb  mbtb", $0)}')
    mod=$(echo "1024^(($index-1)%5)" | bc)
    [ "$mod" -gt 0 ] && 
        echo $(echo "$1" | sed 's/[^0123456789].*$//g')"*$mod" | bc
}

To quickly summarise how it works, it first strips the number from the string given and forces to lowercase. It then use awk to grab the index of the extension from a structured string of valid suffixes. The thing to note is that the string is arranged to multiples of five (so it would need to be widened if more extensions are added), for example k and kb are at indices 2 and 7 respectively. The index is then reduced by one and modulo'd by five so both k and kb become 1, m and mb become 2 and so-on. That's then used to raised 1024 as a power to get the size in bytes. If the extension was invalid this will resolve to a value of zero, and an extension of b (or nothing) will evaluate to 1. So long as mod is greater than zero the input string is reduced to only the numeric part and multiplied by the modifier to get the end result.

This is actually how I would probably have solved this originally if I were using a language like PHP, Java etc., it's just a bit of a weird one to put together in a shell script.

I'd still very much appreciate any simplifications though!

Haravikk
  • 3,109
  • 1
  • 33
  • 46
  • Honestly, your original function is better. The mod 5 thing is overly clever. Clarity beats brevity any day of the week. – John Kugelman Jul 14 '13 at 05:07
  • @John you're probably right. Even so I was looking at this again, as I've noticed the environment I needed it for doesn't actually have `bc`. Out of interest though, how large a value can `let` work with? Maybe I'm doing something wrong but finding guides for it is tricky since it's a builtin, but if I convert `bc` to `let` will I run out of bits on older systems? My original sample uses `let` as well so it's a concern for both. – Haravikk Aug 06 '13 at 16:27
0

Another variation, adding support for decimal values with a simpler T/G/M/K parser for outputs you might find from simpler Unix programs.

to_bytes() {
value=$(echo "$1" | sed -e 's/K//g' | sed -e 's/M//g' | sed -e 's/G//g' | sed -e 's/T//g' )
units=$(echo -n "$1" | grep -o .$ )
    case "$units" in
        T)   value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024 * 1024)")    ;;
        G)   value=$(bc <<< "scale=2; ($value * 1024 * 1024 * 1024)")   ;;
        M)   value=$(bc <<< "scale=2; ($value * 1024 * 1024)")  ;;
        K)   value=$(bc <<< "scale=2; ($value * 1024)") ;;
        b|'')   let 'value += 0'    ;;
        *)
                value=
                echo "Unsupported units '$units'" >&2
        ;;
    esac
echo "$value"
}