5

I have a dataset that abbreviates numerical values in a column. For example, 12M mean 12 million, 1.2k means 1,200. M and k are the only abbreviations. How can I write code that allows R to sort these values from lowest to highest?

I've though about using gsub to convert M to 000,000 etc but that does not take into account the decimals (1.5M would then be 1.5000000).

smci
  • 32,567
  • 20
  • 113
  • 146
  • 2
    You could call [numfmt](https://www.gnu.org/software/coreutils/manual/html_node/numfmt-invocation.html) from `system()` if it is installed on your system and in the PATH. Something like `system(paste("numfmt --from=auto --to=none", "12M"), intern = TRUE)`. – neilfws May 16 '19 at 00:55
  • Related previous discussion - https://stackoverflow.com/questions/36806215/convert-from-k-to-thousand-1000-in-r including an answer in the comments which addresses this. – thelatemail May 16 '19 at 01:49
  • Also, seems we can safely assume normalized mantissas, in your case (`12.00k` or `0.012k` are not normalized, for example) – smci May 16 '19 at 01:59
  • Oh and do you care about handling `NA`, `NaN`, `Inf` without blowing up? – smci May 16 '19 at 02:01

4 Answers4

6
  • So you want to translate SI unit abbreviations ('K','M',...) into exponents, and thus numerical powers-of-ten. Given that all units are single-letter, and the exponents are uniformly-spaced powers of 10**3, here's working code that handles 'Kilo'...'Yotta', and any future exponents:
    > 10 ** (3*as.integer(regexpr('T', 'KMGTPEY')))
    [1] 1e+12

Then just multiply that power-of-ten by the decimal value you have.

  • Also, you probably want to detect and handle the 'no-match' case for unknown letter prefixes, otherwise you'd get a nonsensical -1*3
    > unit_to_power <- function(u) {
        exp_ <- 10**(as.integer(regexpr(u, 'KMGTPEY')) *3)
        return (if(exp_>=0) exp_ else 1)
    }
  • Now if you want to case-insensitive-match both 'k' and 'K' to Kilo (as computer people often write, even though it's technically an abuse of SI), then you'll need to special-case e.g with if-else ladder/expression (SI units are case-sensitive in general, 'M' means 'Mega' but 'm' strictly means 'milli' even if disk-drive users say otherwise; upper-case is conventionally for positive exponents). So for a few prefixes, @DanielV's case-specific code is better.

  • If you want negative SI prefixes too, use as.integer(regexpr(u, 'zafpnum@KMGTPEY')-8) where @ is just some throwaway character to keep uniform spacing, it shouldn't actually get matched. Again if you need to handle non-power-of-10**3 units like 'deci', 'centi', will require special-casing, or the general dict-based approach WeNYoBen uses.

  • base::regexpr is not vectorized also its performance is bad on big inputs, so if you want to vectorize and get higher-performance use stringr::str_locate.

smci
  • 32,567
  • 20
  • 113
  • 146
3

Give this a shot:

Text_Num <- function(x){
    if (grepl("M", x, ignore.case = TRUE)) {
        as.numeric(gsub("M", "", x, ignore.case = TRUE)) * 1e6
    } else if (grepl("k", x, ignore.case = TRUE)) {
        as.numeric(gsub("k", "", x, ignore.case = TRUE)) * 1e3
    } else {
        as.numeric(x)
    }
}
Daniel V
  • 1,305
  • 7
  • 23
  • This is exactly the kind of code I was looking for, I didn't think to multiply after changing to numeric values. Only problem is it returns "NA" when it gets to the "k" values. I'm trying to figure out a work around – Michael O'Keefe May 31 '19 at 20:10
1

In your case you can using gsubfn

a=c('12M','1.2k')
dict<-list("k" = "e3", "M" = "e6")
as.numeric(gsubfn::gsubfn(paste(names(dict),collapse="|"),dict,a))
[1] 1.2e+07 1.2e+03
BENY
  • 317,841
  • 20
  • 164
  • 234
0

I am glad to meet you.

I wrote another answer

Define function

res = function (x) {
  result = as.numeric(x)
  if(is.na(result)){
  text = gsub("k", "*1e3", x, ignore.case = T)
  text = gsub("m", "*1e6", text, ignore.case = T)
  result = eval(parse(text = text))
  } 
  return(result)
}

Result

> res("5M")
[1] 5e+06
> res("4K")
[1] 4000
> res("100")
[1] 100
> res("4k")
[1] 4000
> res("1e3")
[1] 1000
Steve Lee
  • 786
  • 3
  • 10