grepl in R not working properly while subsetting

Question

I have a big dataset with some columns representing amount with Decimal(5,2) format:

DF

Name|Salary|State
Joe|12345.34|AZ
Mac|3423.67|CT
Lilly|12342.345|CA

Clearly only Joe qualifies the criteria so after subsetting I should get records NOT matching the criteria on Salary Column.Thus the result should be

 Name|Salary|State
 Mac|3423.67|CT
 Lilly|12342.345|CA

I want to use subset function:

subset(grepl("^[[:digit:]]{,5}\\.[[:digit:]]{,2}$",DF$Salary)

OR

subset(grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",DF)
subset(grepl("[[:digit:]]{,5}[.][[:digit:]]{,2}",DF)

None of these give me correct result. On further investigation I found that the grepl itself doesnt work properly.

Example:

x <- "12345.45"
grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",x)  # TRUE
grepl("[[:digit:]]{,4}\\.[[:digit:]]{,2}",x)  # TRUE
grepl("[[:digit:]]{,4}[.][[:digit:]]{,2}",x) # TRUE

checked these links: [1]: http://%20https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex [2]: https://stackoverflow.com/questions/25806623/r-regular-expression-for-string-containing-full-stops [3]: https://stackoverflow.com/questions/6638072/escaped-periods-in-r-regular-expressions — Deep, May 25 '20 at 05:47
Why do you want to do this with regex? What is the use-case for this? How does it matter if the number of decimal places is 2 or 3? — Ronak Shah, May 25 '20 at 05:56
Numerically speaking it doesn't matter , but the datatype restricts it to use 2 decimals. It can be upto 2 decimals. Use case is that I am reading a flat file to check its format as per guidelines — Deep, May 25 '20 at 06:06
This question is confusing: Joe's salary has 2 decimals, Lilly's has 3. Why do you say Joe's slaray doesn't fulfill the criteria-- and what are these criteria in the first place? — Chris Ruehlemann, May 25 '20 at 06:33
@ChrisRuehlemann criteria is that the salary should be a decimal with integer part upto 14 digits and decimal part upto 2 digits — Deep, May 25 '20 at 06:40

score 1 · Accepted Answer · answered May 25 '20 at 05:57

1

Floating-point comparisons aren't accurate. Read Why are these numbers not equal? .

However, in this case you can use :

subset(df, !grepl('\\d{5}\\.\\d\\d$', Salary))

answered May 25 '20 at 05:57

Ronak Shah

377,200
20
156
213

Thanks Ronak however it isnt working. Please see the image I attached with the question for data values which are returned although they follow the rule. My Dataset is called readFile: I tried subset(readFile, !grepl('\\d{14}\\.\\d{2}$', readFile[[5]])) OR subset(readFile, !grepl('^\\d{14}\\.\\d{2}$', readFile[[5]])) – Deep May 25 '20 at 06:15
A little modification to your answers helped me:: subset(readFile, !grepl('\\.\\d{0,2}$', readFile[[5]]) | !grepl('^\\d{,14}\\.', readFile[[5]])) , where readFile is my Dataframe and column#5 where I wanted to apply the logic. Accepting your answer, the only issue is that though my file doesnt have decimals more than 3 places , I see the dataframe read from it has more decimals ( sometimes > 5). Maybe something with RStudio display options. readFile <- read.table(file=myFileName,header = TRUE, sep = "|", fill=TRUE, na.strings = c("","NA"), as.is = TRUE, numerals = "no.loss") – Deep May 25 '20 at 06:47
You can control the number of decimal to be displayed by using `options(digits = 10)` or any higher number. – Ronak Shah May 25 '20 at 07:53

grepl in R not working properly while subsetting

1 Answers1