3

I'm merging two data.frames, dat1 and dat2, by temp and the merge is not providing all values for dat2. Why are values from dat2 not merging correctly?

Sample data

dat1 <- data.frame(temp = seq(0, 33.2, 0.1))

dat2 <- structure(list(temp = c(6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7, 
7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8, 8.1, 8.2, 8.3, 
8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 9, 9.1, 9.2, 9.3, 9.4, 9.5, 9.6, 
9.7, 9.8, 9.9, 10, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 
10.8, 10.9, 11, 11.1, 11.2, 11.3, 11.4, 11.5, 11.6, 11.7, 11.8, 
11.9, 12, 12.1, 12.2, 12.3, 12.4, 12.5, 12.6, 12.7, 12.8, 12.9, 
13, 13.1, 13.2), pprox = c(193.53, 626.8, 1055.04, 1478.24, 
1896.41, 2309.55, 2717.64, 3120.69, 3518.7, 3911.66, 4299.58, 
4682.45, 5060.26, 5433.03, 5800.74, 6163.39, 6520.99, 6873.53, 
7221.01, 7563.43, 7900.78, 8233.07, 8560.3, 8882.46, 9199.56, 
9511.59, 9818.55, 10120.44, 10417.27, 10709.03, 10995.71, 11277.33, 
11553.88, 11825.36, 12091.78, 12353.13, 12609.41, 12860.63, 13106.78, 
13347.87, 13583.89, 13814.86, 14040.76, 14261.61, 14477.41, 14688.14, 
14893.83, 15094.47, 15290.05, 15480.59, 15666.09, 15846.55, 16021.96, 
16192.34, 16357.68, 16517.98, 16673.26, 16823.51, 16968.73, 17108.93, 
17244.1, 17374.25, 17499.38, 17619.5, 17734.6, 17844.68, 17949.76, 
18049.82, 18144.87, 18234.91)), row.names = c(NA, 70L), class = "data.frame")

Merge

dat <- left_join(dat1, dat2, by = "temp")

Output

dat[65:70, ]

   temp approx
65  6.4      626.80
66  6.5     1055.04
67  6.6          NA
68  6.7     1896.41
69  6.8          NA
70  6.9     2717.64
Vedda
  • 7,066
  • 6
  • 42
  • 77
  • 1
    In your case, you should use `right_join()` or `full_join()` – yusuzech Jun 13 '18 at 00:24
  • @YifuYan Why? All `temp` values are available in `dat1` that are in `dat2`, so why does it not merge correctly? – Vedda Jun 13 '18 at 00:26
  • @Vedda There could be some `temp` values that are in `dat1` but not in `dat2` or the temp values may be mispelled, hence you can encounter a problem joining them successfully. – DTYK Jun 13 '18 at 00:29
  • @DTYK That is not the case in this example. – Vedda Jun 13 '18 at 00:31

2 Answers2

3

I converted the temp columns in both data frames to a factor, followed by left joining them together. It works!

dat1$temp <- as.factor(dat1$temp)
dat2$temp <- as.factor(dat2$temp)

dat <- left_join(dat1, dat2, by = "temp")
DTYK
  • 1,098
  • 1
  • 8
  • 33
  • Thanks, but that doesn't make any sense to me. Any idea why? – Vedda Jun 13 '18 at 00:41
  • 1
    My (unconfirmed) hunch is that the identifiers used for joining tables together have to be either a factor or a character data type. – DTYK Jun 13 '18 at 00:43
  • 1
    @DTYK - my guess would be that you are running into the classic https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal problem when trying to join on floating point numbers. – thelatemail Jun 13 '18 at 00:46
  • 2
    Yes, you are correct. I double checked the help file and by . `by: a character vector of variables to join by`. Thank you! – Vedda Jun 13 '18 at 00:47
  • 1
    Vedda are you saying DTYK is correct? his solution works but it is not because you can't join on numeric values, I believe thelatemail and the answer above are correct. The help file is saying you must pass a character vector (i.e. "test" or c("test")) not that the column must be contain character vectors. – AidanGawronski Jun 13 '18 at 00:52
3

Hmm interestingly identical(dat2$temp[4],6.6 ) returns TRUE but identical(dat1$temp[67],6.6) returns FALSE.

Floating point issues are a known issues have a look at Why are these numbers not equal? or floating point issue in R? among many other similar posts.

If you set dat1 <- data.frame(temp = round(seq(0, 33.2, 0.1), 2)) it should fix this. Possibly check out ?all.equal as all.equal(dat1$temp[67],6.6 ) is TRUE

Sarah
  • 3,022
  • 1
  • 19
  • 40
  • Thanks! It appears floating points can not merge on using `by`. It must be a character. – Vedda Jun 13 '18 at 00:51
  • 1
    @Vedda - Integers will work fine I believe. Since a `factor` is just an `integer` with a label around it, your solution to turn it into a `factor` just contradicted your point. Floating point accuracy will always be an issue, but merge operations will still work even with numeric data `merge(data.frame(id=0.54,val1=1), data.frame(id=0.54,val2=2))` – thelatemail Jun 13 '18 at 00:56
  • 1
    @Vedda I don't think that's the case. `by` being a "character vector of variables to join by" just means that the column names given have to be a character vector, not columns of characters. In fact, if you first covert both columns to characters then you won't have this problem as `identical(as.character(dat1$temp[67]),as.character(dat2$temp[4]))` is `TRUE`. So the problem with joining is with the inexact matched when both values are floating point – Sarah Jun 13 '18 at 01:02