-1

I have some data for example c("1k", "2k", "1.5k" ...), and would like tp transform the ks to c("1000", "2000", "1500", ...), gsub is quite fast in replacing a large list, but it wouldn't be able to match the 1 or 1.5 and then multiply 1000.

I could match (\d+(.\d{1})?[Kk]), (\d+(.\d{2})?[Kk]), (\d+(.\d{3})?[Kk]) and replace them, but it looks like a brute force approach so I would like to see is there any other ways I can quickly extra the number and then do the calculation?

I tried extracted the number and then multiplied them and then loop through the list and did a gsub individually but it is very slow.

Thanks a lot.

Note that the strings can be ' 1k', 'display price: 1k', '1k - 2k' and some other random characters etc. We always want to get the first price appears so for the 1k - 2k case we want to get 1k. And also there are millions of rows so performance could gets worse when the substitution is being done several times.

Edward
  • 13
  • 4
  • Possible duplicate? https://stackoverflow.com/questions/56159114/converting-unit-abbreviations-to-numbers – thelatemail Aug 26 '19 at 02:28
  • How would one convert "1k - 2k"? What is the expected output in that case? – jdobres Aug 26 '19 at 02:43
  • @jdobres so if there is `1k - 2k`, we want to just get the first one which is `1k` – Edward Aug 26 '19 at 02:47
  • Hi @thelatemail, thanks for pointing out, it is similar but with slight differences, here the price will come with some random characters before and after the price number. – Edward Aug 26 '19 at 02:58

3 Answers3

1

To remove the random characters this first removes all characters except digits, dot, k and K and then replaces k or K and everything thereafter with e3. Finally it converts what is left to numeric.

x <- c("1k", "2k", "1.5k", "   6K", "1k - 2k")
as.numeric(sub("k.*", "e3", gsub("[^0-9.kK]", "", x), ignore.case = TRUE))
## [1] 1000 2000 1500 6000 1000
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

We create a named vector, extract the numeric part and multiply with replaced non-numeric part

unname(as.numeric(gsub("[A-Za-z]+", "", v1)) *
     setNames(c(1e3, 1e6), c('k', 'm'))[sub("[0-9.]+", "", v1)])
#[1]    1000    2000    1500 1700000

data

v1 <- c("1k", "2k", "1.5k", '1.7m')
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi akrun, thanks for the advice, sorry I forgot to mentioned that the original prices will have some random characters as well. I think this will work well if there are no other random characters around the price, unfortunately because of the quality of the data, it contains some other random characters as well. – Edward Aug 26 '19 at 02:35
  • Hi akrun, sorry I put some more examples to it, thanks – Edward Aug 26 '19 at 02:43
0
x = c("1k", "2k", "1.5k", "1k - 2k", "1m", "display price: 1k")
as.numeric(sub(".*(\\d+)k.*", "\\1", x)) * 1000
#[1] 1000 2000 5000 2000   NA 1000
#Warning message:
#NAs introduced by coercion 
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Hi @d.b, thanks for the advice, I tried it and it worked in most of the cases but not work in cases like `'display price: 1k'`, is there ways to resolve the random characters before and after the price? Thanks – Edward Aug 26 '19 at 02:56
  • Hi @d.b, thanks for this. I copied those two lines but it gives me the wrong answer `1000 2000 5000 2000 NA 1000` – Edward Aug 26 '19 at 05:22