Processing data frame column with mixed entries of % and undefined fractions with 0 and *

Question

I have inherited a spreadsheet, read in as a data frame, with ~ 10 columns and ~ 400 rows.

In the example below, for one of those columns, you can see that it contains a mix of both percentage values and fractions.

Furthermore, such fractions may contain ' * ' and/or ' 0 ', and in the numerator and/or denominator, as shown under the OBSERVED column of the example cases shown below.

I seek your help with R code for homogenizing all entries in such columns to decimal numbers, as shown under the EXPECTED column for the examples below, and then repeat this process over all columns in the data frame.

For my analysis, it is quite OK to consider missing values (*) as zeroes (0).

EXAMPLE CASES:

OBSERVED vs. EXPECTED

"0.0%"              0.0
"9.5%"              0.095
"5 / 10"            0.5
"* / 16"            0.0
"0 / 12"            0.0
NA                  0.0
"0 / *"             0.0
"* / *"             0.0

So far what I've tried are as follows (in this same order):

Step 1. Replace * (missing data) with 0 (zero) - works OK

CFP4_REPLACE_Asterisk_w_Zero <- gsub("\\*","0",play.df$CFP4)

Step 2. Convert % to decimals - works OK only on entries with % symbol, but converts fractions to NA

CFP4_ConvPerc2Dcml <- as.numeric(sub("%", "",CFP4_REPLACE_Asterisk_w_Zero,fixed=TRUE))/100

Step 3. Convert fractions into decimal values - syntax shown below, works OK I think, but in this sequential order of steps, the fractions have been already converted to NA, so it is meaningless to execute here...right?

CFP4_ConvFrct2Dcml <- sapply(CFP4_ConvPerc2Dcml, function(x) eval(parse(text=x)))

If I reverse the relative order of steps 2 and 3, that doesn't help either. I've taken a break from R, and would appreciate any (detailed) help. TIA!

Welcome to SO! Please take a moment to read about how to post R questions: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — YOLO, Jul 16 '20 at 05:03
@YOLO - Hope that is better, and that someone can help me with my question. TIA! — AksR, Jul 16 '20 at 05:21

Ronak Shah · Accepted Answer · 2020-07-16T05:57:34.247

1

You can try using this function :

standardize_numbers <- function(x) {
   #Initialize a new vector
   new_vec <- numeric(length = length(x))
   #Get indices where "%" is present
   inds1 <- grepl('%', x)
   #Replace % value
   new_vec[inds1] <- as.numeric(sub('%', '', x[inds1]))/100
   #Get indices where "*" or "/" is present
   inds2 <- grepl('\\*|/', x)
   #Replace "*" with 0
   value2 <- gsub('\\*', '0', x[inds2])
   #Remove whitespace
   value2 <- gsub('\\s', '', value2)
   #Use eval parse to get value2 as numbers. 
   new_vec[inds2] <- sapply(value2, function(y) eval(parse(text = y)))
   new_vec
}

x <- c("0.0%", "9.5%", "5 / 10", "* / 16", "0 / 12", NA, "0 / *", "* / *")
standardize_numbers(x)
#[1] 0.000 0.095 0.500 0.000 0.000 0.000   NaN   NaN

You can apply this to multiple columns using lapply.

edited Jul 16 '20 at 05:57

answered Jul 16 '20 at 05:29

Ronak Shah

377,200
20
156
213

Thank you so much for your function. When I tried your exact same syntax, the error message i am getting is: `> standardize_numbers(x) Error in sub("%", "", x[inds1]) : object 'inds1' not found` – AksR Jul 16 '20 at 05:40
I would add one missing line after new vector initialization: `# Get indices where % is present inds1 <- grepl('%', x)` Then it matches your output, do you concur? Thanks again! – AksR Jul 16 '20 at 05:48
1

Yes, sorry I missed that line while putting this into function. Updated the answer. – Ronak Shah Jul 16 '20 at 05:58

Processing data frame column with mixed entries of % and undefined fractions with 0 and *

1 Answers1