0

In a project where we are computing metrics of graph objects across different edge densities, we have been using a subset call to pull out rows at a specific density. At the moment, density is stored as a numeric field in a data.frame, and the subset also specifies the criterion as a number. This leads to a situation where some subset calls work as expected and others do not. I believe this relates to machine precision on floating point values and we have worked around it by encoding density as a factor, but I wondered if there was a more intelligent way to think about the problem or to understand why R behaves in the way it does. More specifically, I would like to avoid such problems in the future and wonder whether using a factor is the best option, or if there is something more R-thonic.

Thanks for your input!

df <- data.frame(degree=rnorm(20), density=seq(0.01, 0.20, .01))

#only .05, .08, and .09 generate output
for (d in c(.05, .06, .07, .08, .09, .10)) { print(subset(df, density==d)) }

#this works as expected
for (d in seq(.01, .20, .01)) { print(subset(df, density==d)) }

#here is some evidence that machine precision may be to blame
diff(diff(seq(0.01, 0.20, .01)))
  • Yeah, testing equality of floating point numbers pretty reliably leads to trouble like this. – Frank May 17 '17 at 21:15
  • The problem is indeed machine precision; e.g. try `c(.05, .06, .07, .08) %in% seq(0.01, 0.20, .01)`. One way to fix it besides using a factor is to use a group by approach (such as `group_by()` and `summarize()` from dplyr, or data.table) when dividing a df into a sequence of sub-dfs based on one column. Besides avoiding floating point error, this will usually be faster and more expressive. – David Robinson May 17 '17 at 21:15
  • Aside from the comprehensive discussion in the linked duplicate, you are right that you could always keep a non-numeric version of the variable in question and do your subsetting using that. – joran May 17 '17 at 21:16

0 Answers0