Merge skips random rows when combining two dataframes in R

Question

A given species is found in a range of temperatures normally distributed about a mean of 15º. I'm trying to create a dataframe of the proportion of that species by temperature, where the range of all temperatures (0º-30º)is greater than the range of the species.

However, when I try to buffer the ends of the species distribution with 0's to get the full 0-30º range by merging the species' distribution data with a dataframe of all temperatures, the merge function seems to skip some of the values from one dataset when combining it with the other:

df <- as.data.frame(rnorm(100, mean=15, sd=2))
# Change column name
colnames(df) <- "x"
# Calculate density
d <- density(df$x,na.rm=T)
# Check it (looks like a normalish distribution)
plot(d)
# As points
plot(d$x, d$y)
# Convert those points to a dataframe
d1 <- cbind(as.data.frame(d$x), as.data.frame(d$y))
colnames(d1) <- c("x", "y")
# Round x to 0.1
d1$x <- (round(d1$x,1))
# Aggregate by x, calling the new columns temperature and proportion
d2 <- aggregate(list(proportion=d1$y), by=list(temperature=d1$x), FUN="mean")
# Round proportion to 0.001
d2$proportion <- round(d2$proportion, 3)
# Create a vector of temperatures from 0-30 in increments of 0.1
alltemps <- as.data.frame(seq(0,30, by=0.1))
# Change the column heading
colnames(alltemps) <- "temperature"
# Merge the two datasets by temperature
d3 <- merge(alltemps, d2, all.x=T)

From here, I would convert all NAs to 0. But as you can see, merge skips some of the values from d2, putting in NAs where there should be values from d2.

Starting at temperature = 7.2, d2 has a corresponding proportion for each 0.1º temperature increment:

> d2
    temperature proportion
1           7.2      0.000
2           7.3      0.000
3           7.4      0.000
4           7.5      0.000
5           7.6      0.000
6           7.7      0.001
7           7.8      0.001
8           7.9      0.001
9           8.0      0.001
10          8.1      0.002
11          8.2      0.002
12          8.3      0.003
13          8.4      0.003
14          8.5      0.004
15          8.6      0.005
16          8.7      0.006
17          8.8      0.008
18          8.9      0.009
19          9.0      0.010
20          9.1      0.011
21          9.2      0.013
22          9.3      0.014
...
140        21.1      0.004
141        21.2      0.003
142        21.3      0.003
143        21.4      0.002
144        21.5      0.002
145        21.6      0.001
146        21.7      0.001
147        21.8      0.001
148        21.9      0.000
149        22.0      0.000
150        22.1      0.000
151        22.2      0.000
152        22.3      0.000
153        22.4      0.000

alltemps has an increment of 0.1º from 0.0º to 30.0º:

> alltemps
    temperature
1           0.0
2           0.1
3           0.2
4           0.3
5           0.4
6           0.5
7           0.6
8           0.7
9           0.8
10          0.9
11          1.0
...    
69          6.8
70          6.9
71          7.0
72          7.1
73          7.2
74          7.3
75          7.4
76          7.5
...
221        22.0
222        22.1
223        22.2
224        22.3
225        22.4
226        22.5
227        22.6
228        22.7
229        22.8
...
296        29.5
297        29.6
298        29.7
299        29.8
300        29.9
301        30.0

But when you combine them, 'merge' skips some of the values that should be added from d2 (e.g. at 7.3, 7.6, 7.8, etc.):

> d3
    temperature proportion
1           0.0         NA
2           0.1         NA
3           0.2         NA
4           0.3         NA
5           0.4         NA
6           0.5         NA
7           0.6         NA
8           0.7         NA
9           0.8         NA
10          0.9         NA
11          1.0         NA
...
71          7.0         NA
72          7.1         NA
73          7.2      0.000
74          7.3         NA
75          7.4      0.000
76          7.5      0.000
77          7.6         NA
78          7.7      0.001
79          7.8         NA
80          7.9      0.001
81          8.0      0.001
82          8.1      0.002
...
151        15.0      0.186
152        15.1         NA
153        15.2         NA
154        15.3      0.183
155        15.4      0.181
156        15.5      0.178
157        15.6         NA
158        15.7         NA
159        15.8      0.168
160        15.9      0.164
161        16.0      0.159
162        16.1      0.154
163        16.2      0.149
164        16.3      0.144
165        16.4         NA
166        16.5      0.132
...

What's happening here? Is this because d1 is generated from a kernel density estimate rather than real numbers?

I don't see any `NA`s that shouldn't be there after running your code. Could you show what the output you're getting looks like? — ytk, Jun 10 '16 at 20:21
The behavior of `merge` seems appropriate to me. Perhaps it isn't the tool you're actually after. Also, `density()` won't give you a proportion; it gives you, well, a density. You can condense your code in places with the following scheme: `df <- data.frame(x=rnorm(100,15,2))`; no need to rename etc. Maybe I could be of more help if it was clearer what you were after. — rbatt, Jun 10 '16 at 20:26
If you're trying to match on decimals, then your question is really the ever-popular [Why are these numbers not equal?](http://stackoverflow.com/q/9508518/903061) — Gregor Thomas, Jun 10 '16 at 20:57
Good catch, I didn't look carefully enough and actually compare what values were associated with the NA's. Yeah, matching to numbers can be tricky. Check out `?is.integer`, and see the example that defines `is.wholenumber()`. — rbatt, Jun 10 '16 at 23:16

Hack-R · Accepted Answer · 2016-06-10T20:57:51.753

This gets the match rate up between alltemps and d2 by the approximate 70 previously skipped observations. Now, there are still values of temperature in alltemps that aren't present in d2, however this fixes it skipping values in d2 that do match.

The problem was in the float length (not the displayed precision but the true length used for merge). I fixed it by rounding both of the temperature values in the same way before the merge.

df <- as.data.frame(rnorm(100, mean=15, sd=2))
# Change column name
colnames(df) <- "x"
# Calculate density
d <- density(df$x,na.rm=T)
# Check it (looks like a normalish distribution)
plot(d)
# As points
plot(d$x, d$y)
# Convert those points to a dataframe
d1 <- cbind(as.data.frame(d$x), as.data.frame(d$y))
colnames(d1) <- c("x", "y")
# Round x to 0.1
d1$x <- (round(d1$x,1))
# Aggregate by x, calling the new columns temperature and proportion
d2 <- aggregate(list(proportion=d1$y), by=list(temperature=d1$x), FUN="mean")
# Round
d2$proportion <- round(d2$proportion, 3)
d2$temperature <- round(d2$temperature, 1)
# Create a vector of temperatures from 0-30 in increments of 0.1
alltemps <- as.data.frame(seq(0,30, by=0.1))
# Change the column heading
colnames(alltemps) <- "temperature"
alltemps$temperature <- round(alltemps$temperature, 1)
# Merge the two datasets by temperature
d3 <- merge(alltemps, d2) #add back all.x=T if you want it
d3

    > d3
    temperature proportion
1           7.3      0.000
2           7.4      0.000
3           7.5      0.000
4           7.6      0.000
5           7.7      0.000
6           7.8      0.000
7           7.9      0.001
8           8.0      0.001
9           8.1      0.001
10          8.2      0.001
11          8.3      0.002
12          8.4      0.002
13          8.5      0.002
14          8.6      0.003
15          8.7      0.003
16          8.8      0.004
17          8.9      0.004
18          9.0      0.005
19          9.1      0.005
20          9.2      0.006
21          9.3      0.006
22          9.4      0.006
23          9.5      0.006
24          9.6      0.007
25          9.7      0.007
26          9.8      0.007
27          9.9      0.007
28         10.0      0.008
29         10.1      0.008
30         10.2      0.008
31         10.3      0.009
32         10.4      0.010
33         10.5      0.011
34         10.6      0.012
35         10.7      0.013
36         10.8      0.015
37         10.9      0.017
38         11.0      0.019
39         11.1      0.022
40         11.2      0.025
41         11.3      0.028
42         11.4      0.031
43         11.5      0.035
44         11.6      0.039
45         11.7      0.043
46         11.8      0.047
47         11.9      0.051
48         12.0      0.055
49         12.1      0.059
50         12.2      0.062
51         12.3      0.066
52         12.4      0.070
53         12.5      0.074
54         12.6      0.078
55         12.7      0.082
56         12.8      0.086
57         12.9      0.091
58         13.0      0.096
59         13.1      0.101
60         13.2      0.108
61         13.3      0.114
62         13.4      0.120
63         13.5      0.128
64         13.6      0.135
65         13.7      0.141
66         13.8      0.148
67         13.9      0.154
68         14.0      0.159
69         14.1      0.164
70         14.2      0.167
71         14.3      0.170
72         14.4      0.172
73         14.5      0.173
74         14.6      0.173
75         14.7      0.172
76         14.8      0.172
77         14.9      0.171
78         15.0      0.170
79         15.1      0.169
80         15.2      0.168
81         15.3      0.168
82         15.4      0.168
83         15.5      0.168
84         15.6      0.169
85         15.7      0.169
86         15.8      0.170
87         15.9      0.171
88         16.0      0.171
89         16.1      0.171
90         16.2      0.171
91         16.3      0.170
92         16.4      0.168
93         16.5      0.165
94         16.6      0.163
95         16.7      0.159
96         16.8      0.155
97         16.9      0.150
98         17.0      0.145
99         17.1      0.141
100        17.2      0.136
101        17.3      0.131
102        17.4      0.126
103        17.5      0.121
104        17.6      0.116
105        17.7      0.111
106        17.8      0.106
107        17.9      0.101
108        18.0      0.096
109        18.1      0.091
110        18.2      0.086
111        18.3      0.082
112        18.4      0.076
113        18.5      0.071
114        18.6      0.067
115        18.7      0.062
116        18.8      0.057
117        18.9      0.053
118        19.0      0.049
119        19.1      0.044
120        19.2      0.041
121        19.3      0.037
122        19.4      0.034
123        19.5      0.032
124        19.6      0.029
125        19.7      0.026
126        19.8      0.024
127        19.9      0.022
128        20.0      0.020
129        20.1      0.019
130        20.2      0.017
131        20.3      0.015
132        20.4      0.014
133        20.5      0.012
134        20.6      0.011
135        20.7      0.010
136        20.8      0.009
137        20.9      0.008
138        21.0      0.006
139        21.1      0.005
140        21.2      0.005
141        21.3      0.004
142        21.4      0.003
143        21.5      0.002
144        21.6      0.002
145        21.7      0.001
146        21.8      0.001
147        21.9      0.001
148        22.0      0.001
149        22.1      0.000
150        22.2      0.000
151        22.3      0.000
152        22.4      0.000
153        22.5      0.000
nrow(d3) == length(intersect(alltemps$temperature, d2$temperature))
[1] TRUE

My impression is that the NA's are diagnostic of a problem in the OP's eyes, not so much the problem itself. — rbatt, Jun 10 '16 at 20:27
@rbatt They were. The problem was in the float length (not the displayed precision but the true length used for merge). I fixed it by rounding both of the temperature values in the same way before the merge. This increased the number of matches from 88 to 153. Here I'm only showing the matches instead of the legitimate `NA` values as well so that you can see the increased number of matches. — Hack-R, Jun 10 '16 at 20:51
My bad, thanks for clarifying. I didn't look closely enough. — rbatt, Jun 10 '16 at 23:03

score 0 · Answer 2 · answered Jun 10 '16 at 20:34

0

The parameter all.x tells merge to keep all values from alltemps, not from d2. Try the following:

d3 <- merge(alltemps, d2, all = T)

If you specify all = TRUE then merge will keep all values from both data frames. As a side note, you can define your column names at the same time that you create the data frame rather than assigning them separately:

df <- data.frame(x = rnorm(100, mean=15, sd=2))

answered Jun 10 '16 at 20:34

cangers

390
2
9

That doesn't address the 70 values in d2 which should've been matched in alltemps but which were skipped. I fixed the float data in my answer and it increased the number of matches from 88 to 153. – Hack-R Jun 10 '16 at 20:55

Merge skips random rows when combining two dataframes in R

2 Answers2