2

I have a factor column which has numbers in it. Some numbers are written in k format, for ex. 99k, 9.25k, 91.9k, etc. while others are written in whole like 998, 575, etc. Two things I want to do-:

  • I want to multiply the numbers having k by 1000 and remove the k.
  • I want to convert this column into a numeric column without losing any data.

If I first convert it into numeric applying as.numeric, the k format numbers will become NAs.

I can't think of anything to resolve this. Can anyone help me with a solution in r?

Following is the Sample data

df=data.frame(
ID = c(1:8),
Likes = c('99k', '997','15.5k', '9.25k','575', '800', '8.5k', '2,400')
)

EDIT I still don't know what is the problem. After trying all the solutions present here, still NAs getting produced. I am providing the link for full data. Kindly go through it and help me to find the problem. The column I am concerned with is 'Likes'.

https://drive.google.com/file/d/1tR0F4SwGmsVjh5NGGpONP6LTaqgDOM4l/view?usp=sharing

Manu Vats
  • 137
  • 1
  • 8
  • Does this answer your question? [Converting unit abbreviations to numbers](https://stackoverflow.com/questions/56159114/converting-unit-abbreviations-to-numbers) – camille Dec 27 '19 at 16:36
  • You have commas in the numbers that need to be removed. See my edit in the line where I define `x_str`. I added a `gsub()` to remove the commas. As a debugging tip, filter the original dataset to where the `NA` values occur and you can spot things like this fairly quickly. –  Jan 01 '20 at 15:42
  • Will remember it in future!! Thanks a lot for the help!! – Manu Vats Jan 02 '20 at 11:37

5 Answers5

5

Here is an alternative, converting the numbers with ks into scientific notation:

df$Likes_num = as.numeric(gsub("k", "e3", df$Likes))
df
  ID Likes Likes_num
1  1   99k     99000
2  2   997       997
3  3 15.5k     15500
4  4 9.25k      9250
5  5   575       575
6  6   800       800
7  7  8.5k      8500

Edited to cover entire dataset:

So you have upper case Ks, as in 16.1K, upper case Ms, as in 1.2M, and also commas, as in 4,225. The following code takes care of all of these:

df$Likes_num = df$Likes
df$Likes_num = gsub(",", "", df$Likes_num)
df$Likes_num = gsub("K", "e3", df$Likes_num)
df$Likes_num = gsub("M", "e6", df$Likes_num)
df$Likes_num = as.numeric(df$Likes_num)

sum(is.na(df$Likes_num))
## [1] 0
BigFinger
  • 1,033
  • 6
  • 7
2

First detect which records with a "k".

df$is_k <- grepl("k", df$Likes)

Strip the "k", and then convert to numeric. If the record had a "k" then multiple my 1000, else multiple by 1.

df$Likes_num <- as.numeric(gsub("k", "", df$Likes)) * ifelse(df$is_k, 1000, 1)

Edit

For multiple units, I adapted something I had elsewhere for a more complex problem. This shows the steps and is simple enough, though I am not sure how robust it is.

Function

convert_units <- function(x) {
  
  if (class(x) == "numeric") return(x)
  
  # named vector of scalings (you can add to this)
  unit_scale <- c("k" = 1e3, "m" = 1e6)
  
  # clean up some potential nuisances with the input
  x_str <- gsub(",", "", trimws(tolower(as.character(x))))
  
  # extract out the letters
  unit_char <- gsub("[^a-z]", "", x_str)
  
  # extract out the numbers and convert to numeric
  x_num <- as.numeric(gsub("[a-z]", "", x_str), "", x_str)
  
  # develop a vector of multipliers
  multiplier <- unit_scale[match(unit_char, names(unit_scale))]
  multiplier[is.na(multiplier)] <- 1
  
  # multiply
  x_num * multiplier
}

Application

df$Likes2 <- convert_units(df$Likes)

Sample Result

  ID Likes Likes2
1  1   99k  99000
2  2   997    997
3  3 15.5k  15500
4  4 9.25k   9250
5  5   575    575
6  6   800    800
7  7  8.5k   8500
8  8 2,400   2400
Community
  • 1
  • 1
  • Still NAs are being introduced. I guess I have both capital and small 'k' in the dataset. Can you modify the above code with respect to that? – Manu Vats Dec 30 '19 at 13:27
  • Good idea!! One thing I just found out that my dataset has 'M's too. Can you include 'M' in the above code? Thanks – Manu Vats Dec 30 '19 at 13:54
  • Does the link @camille provided in the comment help? –  Dec 30 '19 at 14:11
0

This way assumes that the values with "k" will become NA when changed to numeric, then change those values.

df$LikesNum <- suppressWarnings(as.numeric(as.character(df$Likes)))
df$LikesNum[is.na(df$LikesNum)] <- 1000*as.numeric(as.character(sub("k$", "", df$Likes[is.na(df$LikesNum)])))
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

Here's with dplyr and stringr

df %>%
 mutate(Likes_numeric = str_remove(Likes, "k|K"),
        Likes_numeric = as.numeric(Likes_numeric),
        Likes_numeric = ifelse(str_detect(Likes, "k|K"), Likes_numeric * 1000, Likes_numeric))

  ID Likes Likes_numeric
1  1   99k         99000
2  2   997           997
3  3 15.5k         15500
4  4 9.25k          9250
5  5   575           575
6  6   800           800
7  7  8.5k          8500
Ben G
  • 4,148
  • 2
  • 22
  • 42
0

I looked into your data, finding that you are using K rather than k. For general purpose, I guess you can try the following code, which is case insensitive for replacement

df <- within(df,Likes <- as.numeric(gsub("k", "e3", Likes,ignore.case = TRUE)))
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81