1

Are there any packages in R that can "understand" number from English, for example:

"50 million" -> 50,000,000
"$17.9M" -> 17,900,000

It doesn't have to handle all possible cases, but I want to see how people tackle this problem and I can learn from their code and write my own solution.

Bamqf
  • 3,382
  • 8
  • 33
  • 47
  • You could reverse engineer some of the ideas from [this question](http://stackoverflow.com/questions/28159936/formatting-large-currency-or-dollar-values-to-millions-billions) – Rich Scriven Jul 24 '15 at 04:56
  • google (and wolfram alpha) are getting pretty good at those things, I'd try to look for an API – baptiste Jul 24 '15 at 05:14
  • A similar question has also been discussed [here](http://stackoverflow.com/questions/11340444/is-there-an-r-function-to-format-number-using-unit-prefix) – RHertel Jul 24 '15 at 05:21

1 Answers1

2

This is how I would approach it.

library(stringr)
m <- your_vector
m <- tolower(m) # normalize strings
m <- gsub(",","",m) # drop punctuation
m <- gsub("$","",m) # other punctuation as necessary
m <- gsub("\\s","",m) # drop spaces

dat <- data.frame(raw = m)
dat$words <- str_extract(m,"[a-z].*") # extract words
dat$numbers <- str_extract(m,"[0-9]*") # extract numbers

Then create a new data.frame from unique(dat$words), merge, and multiply.

dat_merge <- data.frame(
   words = unique(dat$words), 
   multiplier = c(1e6,1e6) # from LOOKING at unique(dat$words)
) # new df

dat <- merge(dat, dat_merge)
dat$value <- dat$multiplier * dat$numbers

dat$value

I particularly like this approach, because you can easily update it over time. Especially when you have new formats. I use it personally in a lot of projects for verbatim company names, and some other small text elements.

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255