0

Does anyone know what's the difference in how excel stores the decimals and why the values saved from R are slightly different when loaded back into R? It seems that excel can store up to 15 decimal points, what about R?

I have a dataset with a lot of values which in R display 6 decimal points, and I'm using them for an analysis in lme4. But I noticed that on the same dataset (saved in two different files) the models sometimes converge and sometimes not. I was able to narrow down the problem to the way excel changes the values, but I'm not sure what to do about it.

I have a dataframe like this:

head(Experiment1)
       Response           First.Key     logResponseTime 
1         TREE             2345           3.370143 
2         APPLE             927           2.967080 
3         CHILD             343           2.535294       
4         CAT               403           2.605305       
5         ANGEL             692           2.840106       
6         WINGS             459           2.661813 

log RT was obtained by calculating log10 of First.Key

I then save this to a csv file, and load the df again, and get

head(Experiment2)
       Response           First.Key     logResponseTime 
1         TREE             2345           3.370143 
2         APPLE             927           2.967080 
3         CHILD             343           2.535294       
4         CAT               403           2.605305       
5         ANGEL             692           2.840106       
6         WINGS             459           2.661813

exactly the same values, up to 6 decimal points

but then this happens


 Experiment1$logResponseTime - Experiment2$logResponseTime
1                                             2.220446e-15
2                                            -2.664535e-15
3                                             4.440892e-16
4                                            -4.440892e-16
5                                            -2.220446e-15
6                                             8.881784e-16

These differences are tiny, but they make a difference between convergence and non-convergence in my lmer models, where logResponseTime is the DV, which is why I'm concerned.

Is there a way to save R dataframes into excel to a format that won't make these changes (I use write.csv)? And more importantly, why do such tiny differences make a difference in lmer?

Agata
  • 343
  • 3
  • 13

1 Answers1

3

These tiny bits of rounding are hard to avoid; most of the time, it's not worth trying to fix them (in general errors of this magnitude are ubiquitous in any computer system that uses floating-point values).

It's hard to say exactly what the differences are between the analyses with the rounded and unrounded numbers, but you should be aware that the diagnosis of a convergence problem is based on particular numerical thresholds for the magnitude of the gradient at the maximum likelihood estimate and other related quantities. Suppose the threshold is 0.002 and that running your model with unrounded values results in a gradient of 0.0019, while running it with the rounded values results in a gradient of 0.0021. Then your model will "converge" in one case and "fail to converge" in the other case. I can appreciate the potential inconvenience of getting slightly different values just by saving your data to a CSV (or XLSX) file and restoring them from there, but you should also be aware that even running the same models on a different operating system could produce equally large differences. My suggestions:

  • check to see how big the important differences are between the rounded/unrounded results ("important differences" are differences in estimates you care about for your analysis, of magnitudes that are large enough to change your conclusions)
  • if these are all small, you can increase the tolerance of the convergence checks slightly so they don't bother you, e.g. use control = lmerControl(check.conv.grad = .makeCC("warning", tol = 6e-3, relTol = NULL)) (the default tolerance is 2e-3, see ?lmerControl)
  • if these are large, that should concern you - it means your model fit is very unstable. You should probably also try running allFit() to see how big the differences are when you use different optimizers.
  • you might be able to use the methods described here to make your read/write flow a little more precise.
  • if possible, you could save your data to a .rds or .rda file rather than CSV, which will keep the full precision.
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks, this is actually reassuring! The differences in results are tiny and don't vary enough to change the interpretation of the particular model, it's just that I'm not sure I can report a non-converging model, so then I have to take away (or add) a parameter, let's say a random slope, and that affects the presence of the slope in subsequent analyses with extra predictors, and that probably does make a difference. – Agata Jul 12 '22 at 09:02
  • by the way, how would I go about increasing tolerance of convergence checks? – Agata Jul 12 '22 at 09:27