1

I have a strange problem, when trying to do a left_join from dplyr between two data frames say table_a and table_b which have the column C in common I get lots of NAs except for when the values are zero in both even though the values in the rows match more often.

One thing I did notice was that the C column in table_b on which I would like to match, has values 0 as 0.0 whereas in the table_a, 0 is displayed as simply 0.

A sample is here

head(table_a) gives

  likelihood_ols LR_statistic_ols decision_ols   C
1       -1.51591          0.20246            0 -10
2       -1.51591          0.07724            0  -9
3       -1.51591          0.00918            0  -8
4       -1.51591          0.00924            0  -7
5       -1.51591          0.08834            0  -6
6       -1.51591          0.25694            0  -5

and the other one is here

head(table_b)

quantile    C pctile
1  2.96406  0.0     90
2  4.12252  0.0     95
3  6.90776  0.0     99
4  2.78129 -1.8     90
5  3.92385 -1.8     95
6  6.77284 -1.8     99

Now, there are definitely overlaps between the C columns but only the zeroes are found, which is confusing.

When I subset the unique values in the C columns according to a <- sort(unique(table_a$C)) and b <- sort(unique(table_b$C)) I get the following confusing output:

> a[2]
[1] -9
> b[56]
[1] -9
> a[2]==b[56]
[1] FALSE

What is going on here? I am reading in the values using read.csv and the csvs are generated once on CentOS and once RedHat/Fedora if that plays a role at all. I have tried forcing them to be tibbles or first as characters then numerics and also checked all of R's classes and also checked the types discussed here but to no avail and they all match.

What else could make them different and how do I tell R that they are so I can run my merge function?

halfer
  • 19,824
  • 17
  • 99
  • 186
Hirek
  • 435
  • 3
  • 12
  • Would floating point arithmetic be a concern here? If so, how could I build a workaround? – Hirek Apr 19 '19 at 23:21

1 Answers1

3

Just because two floating point numbers print out the same doesn't mean they are identical.

A simple enough solution is to round, e.g.:

table_a$new_a_likelihood_ols <- signif(table_a$likelihood_ols, 6)
thc
  • 9,527
  • 1
  • 24
  • 39