1

I am trying to push tiny changes to an existing sparklyr code ; these changes are meant to give the same results, only the code is supposed to be more readable and efficient. Therefore, I want to make sure I get the same results, which I have stored in a hive table. In order to do so, I compare my new results to my old results using anti_join:

diff.sdf <- clean_results.sdf %>% 
  anti_join(new_results.sdf, by = unlist(colnames(clean_results.sdf))) 

I don't get a 100% match, and after looking at the details, I suspect anti-join is not working as it should regarding when it comes to doubles. It seems that it may consider different values that are in fact not.

reproducible example (but it might be that going from spark to R and back o spark changes the situation):

structure(list(mnt_tot = 37008.16, date_analyse = "2019-01-31"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("mnt_tot", 
"date_analyse"))

structure(list(mnt_tot = 37008.16, date_analyse = "2019-01-31"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("mnt_tot", 
"date_analyse"))
Ben G
  • 4,148
  • 2
  • 22
  • 42
Vincent
  • 482
  • 6
  • 22
  • I suspect it has to do with floating point notation being different in R and in Spark . . . – Ben G Apr 15 '19 at 18:16
  • . . . You could create a second column where you round the value to the decimal point you care about and then do your anti_join. – Ben G Apr 15 '19 at 18:35
  • @BenG I guess that's a solution ; trying to see if there is anything better and more generic (since my columns are not always the same). – Vincent Apr 15 '19 at 18:54
  • Probably a dupe of [Why are these numbers not equal](https://stackoverflow.com/q/9508518/903061). Comparing doubles is not recommended as reliable within a single language. Add a conversion to another language in there and all bets are off. – Gregor Thomas Apr 15 '19 at 18:58
  • (And, your example doesn't *reproduce the problem*. Calling your first `dput` `df1` and the second `df2`, `anti_join(df1, df2)` works just fine.) – Gregor Thomas Apr 15 '19 at 19:01
  • @Gregor thanks for enlighting me ; how would you do then ? what's the best practice in that case ? rounding ? – Vincent Apr 15 '19 at 21:45
  • Probably do a regular join (not an anti-join) on integer/non-numeric column and calculate the absolute difference. Make sure all the differences are below some tolerance. That suggestion and more are at the duplicate. – Gregor Thomas Apr 15 '19 at 22:34

0 Answers0