2

I have a dataframe with values as below:

BrandName  Expense
Apple      $1.8B
Google     $3.2B
GE         -
facebook   $281M
McDonald   $719M

I want to clean these expense values such that they are finally on same scale (in billions). For ex the final data frame should look like:

BrandName  Expense
Apple      1.8
Google     3.2
facebook   0.281
McDonald   0.719

$ can be simply removed by gsub. This is fine. But I am facing problem afterwards. I am applying a function A which uses grepl to check if the value contains 'M', if true (strip 'M', convert to numeric value, and divide by 1000) and if it returns false (strip 'B', convert to numeric value)

A <- function(x){
  if (grepl("M", x))
  {
    str_replace(x, "M", "")
    as.numeric(x)
    x <- x/1000
  }
  else if (grepl("B", x))
  {
    str_replace(x, "B", "")
    as.numeric(x)
  }
}
frame <- data.frame(frame[1], apply(frame[2],2, A))

But all the expense values are coming out to be NA in final result. On further analysis, I noticed for all values, its going in elseif part. Am i making a bad use of grepl in apply function ? If yes how can i fix it.

or any other better solution to solve this particular problem?

cexplorer
  • 549
  • 5
  • 13
  • @akrun: The question you suggested is exactly opposite to what I asked. – cexplorer Sep 25 '16 at 10:55
  • 1
    In your function, you should assign the returned values of `str_replace` and `as.numeric` to `x`. In addition, `apply(frame[2], 2, A)` applies `grepl` to the entire column (i.e. a vector of strings), which is why you might have gotten the `the condition has length > 1 and only the first element will be used` warning. You should `apply` by rows, i.e. `apply(frame[2], MAR = 1, A)` or use `sapply`. This is why `grepl` isn't working as you expect. – Weihuang Wong Sep 25 '16 at 11:11
  • @WeihuangWong: Thank you for pointing out my mistakes, you got a point! I will check if it works once I fix the problems mentioned by you. Thanks once again. – cexplorer Sep 26 '16 at 12:03

2 Answers2

1

Here is a base R solution which might be more sensible for your problem, depending on your needs:

df$ExpenseScaled <- as.numeric(gsub("[$MB]", "", df$Expense))
m.index          <- substr(df$Expense, nchar(df$Expense), nchar(df$Expense)) == 'M'
df$ExpenseScaled[m.index] <- df$ExpenseScaled[m.index] / 1000

 df
 BrandName Expense ExpenseScaled
1     Apple   $1.8B         1.800
2    Google   $3.2B         3.200
3  Facebook   $281M         0.281
4 McDonalds   $719M         0.719

The first line of code removes the dollar sign and amount symbol (B or M) to obtain a numerical amount. The next two lines of code conditionally divide millions figures by 1000 per your specification.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • @cexplorer Sorry, I had a typo in there. I meant to use `m.index` to subset the `ExpenseScaled` column. Try it again, it just worked for me. – Tim Biegeleisen Sep 25 '16 at 11:17
  • Perfect! I think this is the most elegant solution to my problem. Just out of curiosity what was wrong in the solution I was trying. I know there would be multiple problems ;) – cexplorer Sep 25 '16 at 11:21
  • @cexplorer I see multiple problems indeed, such as you calling `as.numeric(x)`. This won't actually _set_ the `x` variable to be numeric. To do that, you would need an _assignment_, e.g. `x <- as.numeric(x)`. Also, there appear to be problems in the way you were trying to rollup data in your custom function. I don't think you need a custom function to handle this problem. – Tim Biegeleisen Sep 25 '16 at 11:25
  • yeah that silly mistake I noticed. But why grepl in first case always returned false ? Why was it always going in the other grep ? Any idea – cexplorer Sep 25 '16 at 11:32
  • @cexplorer I'm not convinced that it always went to the `else` condition. I think some other problems were happening. – Tim Biegeleisen Sep 25 '16 at 11:39
  • OP: See my comment to your question above. The problem is that you are `apply`ing column-wise. – Weihuang Wong Sep 25 '16 at 11:55
0

We can do this with gsubfn. We remove the $ with sub, then replace the 'B', and 'M' with 1 and * 1/1000 using gsubfn, loop through the vector and evaluate the string.

library(gsubfn)
df1$Expense <-  unname(sapply(gsubfn("([A-Z])$", list(B=1, M=' * 1/1000'), 
          sub("[$]", "", df1$Expense)), function(x) eval(parse(text=x))))
df1
#   BrandName Expense
#1     Apple   1.810
#2    Google   3.210
#3  facebook   0.281
#4  McDonald   0.719

Or a base R option would be to extract the numeric substring ('val'), the substring at the end ('nm1'), convert the 'val' to numeric and multiply with the 1, 1/1000 based on matching the substring 'nm1' with the key/value' string created.

val <- gsub("[^0-9.]+", "", df1$Expense)
nm1 <- sub(".*(.)$", "\\1", df1$Expense)
df1$Expense <-  as.numeric(val)*setNames(c(1, 1/1000), c("B", "M"))[nm1]
df1
#  BrandName Expense
#1     Apple   1.800
#2    Google   3.200
#3  facebook   0.281
#4  McDonald   0.719

NOTE: This should be also be extended in case there are Trillions, Thousands etc. in both the methods, i.e. in the first method change inside the list(...) and in second we change by creating more key/value groups in setNames(c(1, ...), c("B", "M", ...))


Another option is parse_number from readr with dplyr

library(dplyr)
library(readr)
df1 %>% 
   mutate(Expense = parse_number(Expense)/c(1, 1000)[grepl("M", Expense)+1])
#   BrandName Expense
#1     Apple   1.800
#2    Google   3.200
#3  facebook   0.281
#4  McDonald   0.719

data

df1 <- structure(list(BrandName = c("Apple", "Google", "facebook", "McDonald"
), Expense = c("$1.8B", "$3.2B", "$281M", "$719M")), .Names = c("BrandName", 
"Expense"), class = "data.frame", row.names = c(NA, -4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Its working on sample, but giving error on my data. Error in parse(text = x) : :2:0: unexpected end of input 1: - ^ I forgot to mention there are missing values in form of "-" in expense column. Is that the reason why its not working on my dataset.? – cexplorer Sep 25 '16 at 11:06
  • @cexplorer I also updated with another option. Please check if that also give the same problem. – akrun Sep 25 '16 at 11:07