I'd think it can be generalized a little in order to allow some room for the calling-function to deal with other things, as needed. Replacing the substring in-place gives some interesting power, I think.
Here's a suggestion that will replace the human-readable numbers with the long-drawn-out numbers, as many times as they may appear, in as many strings as you pass to it.
This is certainly not smaller or faster than your existing solution, but it is usable in other ways.
opp_humanReadable <- function(vec) {
known <- c(B = 1000, kB = 1e+06, MB = 1e+09, GB = 1e+12, TB = 1e+15, PB = 1e+18,
EB = 1e+21, ZB = 1e+24, YB = 1e+27, KiB = 1048576, MiB = 1073741824,
GiB = 1099511627776, TiB = 1125899906842624, PiB = 1152921504606846976,
EiB = 1.18059162071741e+21, ZiB = 1.20892581961463e+24, YiB = 1.23794003928538e+27,
b = 1024, Kb = 1048576, Mb = 1073741824, Gb = 1099511627776,
Tb = 1125899906842624, Pb = 1152921504606846976, KB = 1048576
)
ptn <- paste0(
"(-?\\d+\\.?\\d*|\\d*\\.?\\d)",
"\\s*",
"(", paste0(names(known), collapse = "|"), ")\\b")
gre <- gregexpr(ptn, vec)
matches <- regmatches(vec, gre)
unit <- lapply(matches, gsub, pattern = "^[-.0-9]*\\s*", replacement = "")
rest <- lapply(matches, gsub, pattern = "^[-.0-9]*(\\s*)\\S*$", replace = "\\1")
num <- lapply(matches, gsub, pattern = "[^-.0-9]", replacement = "")
newnum <- Map(function(a, p) {
if (length(a)) {
sapply(as.numeric(a) * known[p], format, scientific = FALSE)
} else character(0)
}, num, unit)
regmatches(vec, gre) <- Map(paste0, newnum, rest, unit)
vec
}
vec <- c('100.1 MB 2 KiB', '100.1MB', 'foo -100.1 MB quux', '9 kB', '10 kB', '9 xx',
'.2 GiB', 'hello -.2PB world')
opp_humanReadable(vec)
# [1] "100100000000 MB 2097152 KiB" "100100000000MB"
# [3] "foo -100100000000 MB quux" "9000000 kB"
# [5] "10000000 kB" "9 xx"
# [7] "219902325555 GiB" "hello -200000000000000000PB world"
It tries to preserve spaces-within and spaces-around the number/unit.
If you're curious, I derived known
with
# adapted from utils:::format.object_size
known_units <- list(
SI = c("B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"),
IEC = c("B", "KiB", "MiB", "GiB", "TiB", "PiB", "EiB", "ZiB", "YiB"),
legacy = c("b", "Kb", "Mb", "Gb", "Tb", "Pb"),
LEGACY = c("B", "KB", "MB", "GB", "TB", "PB"))
known_bases <- c(SI = 1000, IEC = 1024, legacy = 1024, LEGACY = 1024)
known <- Map(function(un, ba) setNames(ba^(seq_along(un)), un),
known_units, known_bases)
for (i in seq_along(known)[-1]) {
nms <- names(known[[i]])
known[[i]] <- known[[i]][ nms[ ! nms %in% unlist(lapply(known[1:(i-1)], names)) ] ]
}
known <- unlist(unname(known))
Kludgy perhaps, but I know if I didn't do it programmatically, I would miss a comma or something.
An extension to this function might accept some format
-like arguments such as big.mark=
, small.mark=
, etc. Better yet, as a companion function that "finds" the numbers (allegedly after calling this function) and inserts commas, etc.