0

I have two data frames I want to merge based on a numeric value, however I am having trouble with floating point accuracy. Example:

> df1 <- data.frame(number = 0.1 + seq(0.01,0.1,0.01), letters = letters[1:10])
> df2 <- data.frame(number = seq(0.11,0.2,0.01), LETTERS = LETTERS[1:10])
> (merged <- merge(df1, df2, by = "number", all = TRUE))
   number letters LETTERS
1    0.11       a       A
2    0.12    <NA>       B
3    0.12       b    <NA>
4    0.13       c       C
5    0.14       d       D
6    0.15    <NA>       E
7    0.15       e    <NA>
8    0.16       f       F
9    0.17       g       G
10   0.18       h       H
11   0.19       i       I
12   0.20       j       J

Some of the values (0.12 and 0.15) don't match up due to floating point accuracy issues as discussed in this post. The solution for finding equality there was the use of the all.equal function to remove floating point artifacts, however I don't believe there is a way to do this within the merge function.

Currently I get around it by forcing one of the the number columns to a character and then back to a number after merge, but this is a little clunky; does anyone have a better solution for this problem?

> df1c <- df1
> df1c[["number"]] <- as.character(df1c[["number"]])
> merged2 <- merge(df1c, df2, by = "number", all = TRUE)
> merged2[["number"]] <- as.numeric(merged2[["number"]])
> merged2
   number letters LETTERS
1    0.11       a       A
2    0.12       b       B
3    0.13       c       C
4    0.14       d       D
5    0.15       e       E
6    0.16       f       F
7    0.17       g       G
8    0.18       h       H
9    0.19       i       I
10   0.20       j       J

EDIT: A little more about the data

  • I wanted to keep my question general to make it more applicable to other people's problems, but it seems I may need to be more specific to get an answer.
  • It is likely that all of the issues with merging with be due to floating point inaccuracy, but it may be a little hard to be sure. The data comes in as a series of time series values, a start time, and a frequency. These are then turned into a time series (ts) object and a number of functions are called to extract features from the time series (one of which is the time value), which is returned as a data frame. Meanwhile another set of functions is being called to get other features from the time series as targets. There are also potentially other series getting features generated to complement the original series. These values then have to be reunited using the time value.
  • Can't store as POSIXct: Each of these processes (feature extraction, target computation, merging) has to be able to occur independently and be stored in a CSV type format so it can be passed to other platforms. Storing as a POSIXct value would be difficult since the series aren't necessarily stored in calendar times.
smci
  • 32,567
  • 20
  • 113
  • 146
Barker
  • 2,074
  • 2
  • 17
  • 31
  • 2
    This is why you never use a value as the key... see if you can change your workflow... but in the meantime, have you tried using the join family from dplyr? for example, `dplyr::full_join(df1c,df2,by="number")` or some such – Amit Kohli Oct 21 '16 at 18:25
  • The numeric values I am merging come from the `time` values of a time series and converting those to a character or factor from a numeric has its own difficulties. Also, `dplyr::full_join()` doesn't work either. You ran it with `df1c` where I had converted `number` to a character rather than with `df1`. – Barker Oct 21 '16 at 18:31
  • why not join directly on the date/time field w/out converting to numeric? – Amit Kohli Oct 21 '16 at 18:33
  • The data has to be passed from one place to another and that is not a supported format for passing data in this environment. – Barker Oct 21 '16 at 18:42
  • 1
    Just for the above example , `cbind( df1, df2[match( round(df1$number,2),round(df2$number,2)),-1] )` would work. Though it will be equivalent to merge with all.x=TRUE whwn df1 has more rows than df2 – R.S. Oct 21 '16 at 18:53
  • Thanks R.S., I wouldn't have though to use `match` like that. I am a little concerned with having to choose the accuracy of the round though since this needs to work with arbitrary time series seasonality. I know when I tried rounding for equality before, if I used a big number like 20 sig figs, sometimes there would be no match, but when I used a smaller number like 5, it failed when we wanted to work with series of seconds/year. Do you have a good metric for choosing the number of sig figs? I think `all.equal` uses information about how the numbers are encoded. – Barker Oct 21 '16 at 19:07
  • The root-cause is that `merge.data.frame` doesn't expose a `tolerance` argument unlike [`all.equal`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/all.equal). I think the solution here is a hack, and rounding timestamps may be destructive, and cause inexact compares/merges elsewhere, i.e. just push the problem around. It would help if you say more about your input format. – smci Jul 30 '19 at 06:42
  • @smci I posted this problem almost 3 years ago. I no longer have access to the original data or a 100% clear memory of everything I was doing at the time. If there is something specific about the data you would like to know, I can try to remember if it is something you think will help future users referencing this post. Otherwise, I don't know what to tell you. – Barker Jul 30 '19 at 17:52
  • @Barker: that's fine, you can generate random-seeded (numeric) data that proves the point, and edit the question. – smci Jul 30 '19 at 18:49
  • @smci I am telling you I don't remember all the details about this question any more. If you have a specific question, I will see if I can remember. If not I have already provided a repro in this question. Personally, I no longer need an answer to this question. If more info would help others I will do my best to help. – Barker Jul 30 '19 at 21:08

1 Answers1

1

Round to the level of precision that will allow the number to be equal.

> df1$number=round(df1$number,2)
> df2$number=round(df2$number,2)
> 
>  (merged <- merge(df1, df2, by = "number", all = TRUE))
   number letters LETTERS
1    0.11       a       A
2    0.12       b       B
3    0.13       c       C
4    0.14       d       D
5    0.15       e       E
6    0.16       f       F
7    0.17       g       G
8    0.18       h       H
9    0.19       i       I
10   0.20       j       J

If you need to choose the level of precision programmatically then you should tell us more about the data and whether we can perhaps assume that it will always be due to floating point inaccuracy. If so, then rounding to 10 decimal places should be fine. The all.equal function uses sqrt(.Machine$double.eps) which in usually practice should be similar to round( ..., 16).

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I have the same issue but round is not working, I had to use signif, why? – Herman Toothrot Jul 18 '18 at 09:00
  • "Not working" is generally unsuccessful as a problem description, but depending on your goals and the problem at hand, you might one of choose `round`, `signif` or `sprintf` – IRTFM Jul 18 '18 at 15:33