0

Sample of dataset Full sample can be download via this link

Date/Time,Hs,Hmax,Tp,Tz,Peak Direction,SST
1/01/2018 0:00,-99.9,-99.9,-99.9,-99.9,-99.9,-99.9
1/01/2018 0:30,0.513,0.81,10.315,4.748,-99.9,-99.9
1/01/2018 1:00,0.566,0.93,10.778,5.003,92,26.4
1/01/2018 1:30,0.557,0.85,9.984,4.99,91,26.4

Read in via this method, and all columns except date.time are numeric.

maloolaba.waves <- read.csv(file = "./data/mooloolaba_2018-01-01t00_00-2018-10-31t23_30.csv", header = T)

Function to remove rows containing -99.9.

maloo.RM.outlier <- maloolaba.waves[!(apply(maloolaba.waves, 1, 
                             function(y) any(y == -99.9) )),]

Now when I do summary after removing value -99.9 I get this.

summary(maloo.RM.outlier)

          Date.Time           Hs               Hmax        
 1/01/2018 1:00 :    1   Min.   :-99.900   Min.   :-99.900  
 1/01/2018 1:30 :    1   1st Qu.:  0.805   1st Qu.:  1.350  
 1/01/2018 10:00:    1   Median :  1.112   Median :  1.870  
 1/01/2018 10:30:    1   Mean   :  1.234   Mean   :  2.089  
 1/01/2018 11:00:    1   3rd Qu.:  1.608   3rd Qu.:  2.700  
 1/01/2018 11:30:    1   Max.   :  4.257   Max.   :  7.262  
 (Other)        :14543                                      
       Tp                Tz          Peak.Direction      SST       
 Min.   :-99.900   Min.   :-99.900   Min.   :  5    Min.   :19.80  
 1st Qu.:  7.529   1st Qu.:  5.035   1st Qu.: 91    1st Qu.:21.00  
 Median :  9.146   Median :  5.568   Median :105    Median :23.00  
 Mean   :  9.245   Mean   :  5.679   Mean   :103    Mean   :23.43  
 3rd Qu.: 10.903   3rd Qu.:  6.257   3rd Qu.:119    3rd Qu.:26.00  
 Max.   : 21.121   Max.   : 10.146   Max.   :358    Max.   :28.65 

Yet when I look at the dataset for maloo.RM.outlier, there are no values -99.9, so I then searched.

which(maloo.RM.outlier$Hs == -99.9, arr.ind = T)

[1] 11501 13775

I have tried looking at the row numbered 11501 and 13775, no -99.9 values there. I have tried, clearing the global environment data, restarting the R session and nothing seems to fully get rid of the value -99.9 and the summary still says the minimum = -99.9. Does anyone know how to remove floating point values?

tcratius
  • 525
  • 6
  • 15
  • 3
    Possible duplicate of: https://stackoverflow.com/q/9508518/1222578 – Marius Apr 24 '19 at 05:27
  • Ok, floating point values, yes makes sense, can you please show me how I can isolate the rows. For instance, sum(maloo.RM.outlier$Hs < 0 ) output [2] yet I can not even find these two negative value. – tcratius Apr 24 '19 at 06:43
  • 1
    `any(abs(y + 99.9) < 1e-9)` should find them. – r2evans Apr 24 '19 at 06:45
  • 1
    @tcratius I like `near` from the `dplyr` package, instead of `any(y == -99.9)` you could do `dplyr::near(y, -99.9)`. – Marius Apr 24 '19 at 06:47
  • 1
    @r2evans and Marius please make as the possible answer as they have solved my problem. r2evans, your answer worked better with the apply function however I can see the value in Marius' answer. – tcratius Apr 24 '19 at 07:06
  • What does this have to do with `Rmarkdown`? – asachet Apr 24 '19 at 08:26

1 Answers1

2

Because of R's FAQ 7.31, you can't really test for floating-point equality, just an approximation. There are several ways to do it, but a popular (and my favorite) way is to subtract my limit/equality and look for something below a threshold.

Because the actual value might be on either side (pos/neg) of my comparison value, we can use the absolute value to take that into account. The resulting code changes your

any(y == -99.9)

to

any( abs(y + 99.9) < 1e-9 )

Which, coincidentally, is precisely what Marius' suggested function (dplyr::near) is doing:

dplyr::near
# function (x, y, tol = .Machine$double.eps^0.5) 
# {
#     abs(x - y) < tol
# }
# <bytecode: 0x000000002506d7b8>
# <environment: namespace:dplyr>

though it is using a slightly more robust way of finding something of just-barely-above a "near-zero" magnitude by using .Machine$double.eps^0.5.

I chose 1e-9 for code-golf, though if you are programming something, you should probably name it something meaningful so it isn't a "magic constant". Perhaps tol <- 1e-9 or eps <- 1e-9 (for epsilon, a variable frequently used to indicate an arbitrarily small, positive number).

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Hi @r2evans, if you would like to download the Maloolaba go to this [link and download](https://www.tworiel.cc/post/2019-04-21-data-cleansing/data-cleansing/) the full Maloolaba sample dataset. I personally found using ```maloo.RM.outlier <- maloolaba.waves[!(apply(maloolaba.waves[,2:6], 1, function(y) any(abs(y + 99.9) < 1e-9))),]``` worked the best. – tcratius Apr 25 '19 at 02:31