0

I know there's gdata::humanReadable() that will convert 10000 to "9.8 KiB" etc. - but how about the opposite conversion? I'm sure there should be one as well but can't find by a quick search.
So far I'm using my own quick-and-dirty solution:

get_size_bytes <- function(inpstr){
  sizes <- c(kB =1000,
             kiB=2^10,
             MB =1e3^2,
             MiB=(2^10)^2)

  suffix <- gsub( '[\\.0-9]+ ?',   '',   inpstr)
  number <- gsub('([\\.0-9]+) ?.*','\\1',inpstr)
  mult <- sizes[suffix]

  return(unname(as.numeric(number)*mult))
}

# usage example:
get_size_bytes(c('100.1 MB', '9 kB', '10 kB', '9 xx'))
# [1] 100100000      9000     10000        NA
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • Also `utils:::format.object_size` (though not the direction you're suggesting). – r2evans Jun 18 '20 at 23:31
  • I haven't seen one going in the opposite direction. Other than potentially adding giga, tera, peta, etc, I don't see a way that is significantly different than your function here. Do you see failed corner cases or lack-of-generality in this function? – r2evans Jun 18 '20 at 23:35
  • well there's a lot of minor things that could be added, for example correct treating of leading spaces (`" 1 MB"`), multiple spaces (`1 MB`), negative values (`-1 MB`), lowercase (`1 kb`), etc etc. And just in general I would prefer to reuse something from CRAN rather than reinvent the weel. – Vasily A Jun 18 '20 at 23:42
  • I understand your preference towards not reinventing this wheel, though I do not know of any. So your question is as much about resilience to malformed strings as it is about CRAN-availability, is that right? If you always want excess whitespace ignored, then just ... remove it always on input. – r2evans Jun 18 '20 at 23:47
  • If you want to do an in-string replacement, though, it might be more generically applicable: `"I have 1K apples"` --> `"I have 1000 apples"`? (Remove the assumption of `B`?) – r2evans Jun 18 '20 at 23:48
  • 1
    you're right; my main motivation was reluctance to add my own function if there's an existing one already. Now I realize there's probably not one indeed - in that case I would still keep my question posted here (I see there are votes to close it already) in case my code would be useful to someone. – Vasily A Jun 18 '20 at 23:55
  • 1
    P.S. switching from bytes to more general in-string replacement could be potentially useful but I think makes it more complicated, for now I would prefer to have it size-focused. – Vasily A Jun 18 '20 at 23:56

1 Answers1

0

I'd think it can be generalized a little in order to allow some room for the calling-function to deal with other things, as needed. Replacing the substring in-place gives some interesting power, I think.

Here's a suggestion that will replace the human-readable numbers with the long-drawn-out numbers, as many times as they may appear, in as many strings as you pass to it.

This is certainly not smaller or faster than your existing solution, but it is usable in other ways.

opp_humanReadable <- function(vec) {
  known <- c(B = 1000, kB = 1e+06, MB = 1e+09, GB = 1e+12, TB = 1e+15, PB = 1e+18, 
             EB = 1e+21, ZB = 1e+24, YB = 1e+27, KiB = 1048576, MiB = 1073741824, 
             GiB = 1099511627776, TiB = 1125899906842624, PiB = 1152921504606846976, 
             EiB = 1.18059162071741e+21, ZiB = 1.20892581961463e+24, YiB = 1.23794003928538e+27, 
             b = 1024, Kb = 1048576, Mb = 1073741824, Gb = 1099511627776, 
             Tb = 1125899906842624, Pb = 1152921504606846976, KB = 1048576
             )
  ptn <- paste0(
    "(-?\\d+\\.?\\d*|\\d*\\.?\\d)",
    "\\s*",
    "(", paste0(names(known), collapse = "|"), ")\\b")
  gre <- gregexpr(ptn, vec)
  matches <- regmatches(vec, gre)
  unit <- lapply(matches, gsub, pattern = "^[-.0-9]*\\s*", replacement = "")
  rest <- lapply(matches, gsub, pattern = "^[-.0-9]*(\\s*)\\S*$", replace = "\\1")
  num <- lapply(matches, gsub, pattern = "[^-.0-9]", replacement = "")
  newnum <- Map(function(a, p) {
    if (length(a)) {
      sapply(as.numeric(a) * known[p], format, scientific = FALSE)
    } else character(0)
  }, num, unit)
  regmatches(vec, gre) <- Map(paste0, newnum, rest, unit)
  vec
}

vec <- c('100.1   MB 2 KiB', '100.1MB', 'foo  -100.1 MB quux', '9 kB', '10 kB', '9 xx',
         '.2 GiB', 'hello -.2PB world')
opp_humanReadable(vec)
# [1] "100100000000   MB 2097152 KiB"     "100100000000MB"                   
# [3] "foo  -100100000000 MB quux"        "9000000 kB"                       
# [5] "10000000 kB"                       "9 xx"                             
# [7] "219902325555 GiB"                  "hello -200000000000000000PB world"

It tries to preserve spaces-within and spaces-around the number/unit.

If you're curious, I derived known with

# adapted from utils:::format.object_size
known_units <- list(
  SI = c("B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"),
  IEC = c("B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"),
  legacy = c("b", "Kb", "Mb", "Gb", "Tb", "Pb"),
  LEGACY = c("B", "KB", "MB", "GB", "TB", "PB"))
known_bases <- c(SI = 1000, IEC = 1024, legacy = 1024, LEGACY = 1024)
known <- Map(function(un, ba) setNames(ba^(seq_along(un)), un),
             known_units, known_bases)
for (i in seq_along(known)[-1]) {
  nms <- names(known[[i]])
  known[[i]] <- known[[i]][ nms[ ! nms %in% unlist(lapply(known[1:(i-1)], names)) ] ]
}
known <- unlist(unname(known))

Kludgy perhaps, but I know if I didn't do it programmatically, I would miss a comma or something.

An extension to this function might accept some format-like arguments such as big.mark=, small.mark=, etc. Better yet, as a companion function that "finds" the numbers (allegedly after calling this function) and inserts commas, etc.

r2evans
  • 141,215
  • 6
  • 77
  • 149