0

I have a big dataset with some columns representing amount with Decimal(5,2) format:

DF

Name|Salary|State
Joe|12345.34|AZ
Mac|3423.67|CT
Lilly|12342.345|CA

Clearly only Joe qualifies the criteria so after subsetting I should get records NOT matching the criteria on Salary Column.Thus the result should be

 Name|Salary|State
 Mac|3423.67|CT
 Lilly|12342.345|CA

I want to use subset function:

subset(grepl("^[[:digit:]]{,5}\\.[[:digit:]]{,2}$",DF$Salary)

OR

subset(grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",DF)
subset(grepl("[[:digit:]]{,5}[.][[:digit:]]{,2}",DF)

None of these give me correct result. On further investigation I found that the grepl itself doesnt work properly.

Example:

x <- "12345.45"
grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",x)  # TRUE
grepl("[[:digit:]]{,4}\\.[[:digit:]]{,2}",x)  # TRUE
grepl("[[:digit:]]{,4}[.][[:digit:]]{,2}",x) # TRUE

enter image description here

Deep
  • 528
  • 3
  • 12
  • 27
  • checked these links: [1]: http://%20https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex [2]: https://stackoverflow.com/questions/25806623/r-regular-expression-for-string-containing-full-stops [3]: https://stackoverflow.com/questions/6638072/escaped-periods-in-r-regular-expressions – Deep May 25 '20 at 05:47
  • Why do you want to do this with regex? What is the use-case for this? How does it matter if the number of decimal places is 2 or 3? – Ronak Shah May 25 '20 at 05:56
  • Numerically speaking it doesn't matter , but the datatype restricts it to use 2 decimals. It can be upto 2 decimals. Use case is that I am reading a flat file to check its format as per guidelines – Deep May 25 '20 at 06:06
  • This question is confusing: Joe's salary has 2 decimals, Lilly's has 3. Why do you say Joe's slaray doesn't fulfill the criteria-- and what are these criteria in the first place? – Chris Ruehlemann May 25 '20 at 06:33
  • @ChrisRuehlemann criteria is that the salary should be a decimal with integer part upto 14 digits and decimal part upto 2 digits – Deep May 25 '20 at 06:40

1 Answers1

1

Floating-point comparisons aren't accurate. Read Why are these numbers not equal? .

However, in this case you can use :

subset(df, !grepl('\\d{5}\\.\\d\\d$', Salary))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks Ronak however it isnt working. Please see the image I attached with the question for data values which are returned although they follow the rule. My Dataset is called readFile: I tried subset(readFile, !grepl('\\d{14}\\.\\d{2}$', readFile[[5]])) OR subset(readFile, !grepl('^\\d{14}\\.\\d{2}$', readFile[[5]])) – Deep May 25 '20 at 06:15
  • A little modification to your answers helped me:: subset(readFile, !grepl('\\.\\d{0,2}$', readFile[[5]]) | !grepl('^\\d{,14}\\.', readFile[[5]])) , where readFile is my Dataframe and column#5 where I wanted to apply the logic. Accepting your answer, the only issue is that though my file doesnt have decimals more than 3 places , I see the dataframe read from it has more decimals ( sometimes > 5). Maybe something with RStudio display options. readFile <- read.table(file=myFileName,header = TRUE, sep = "|", fill=TRUE, na.strings = c("","NA"), as.is = TRUE, numerals = "no.loss") – Deep May 25 '20 at 06:47
  • You can control the number of decimal to be displayed by using `options(digits = 10)` or any higher number. – Ronak Shah May 25 '20 at 07:53