2

I need to combine to data frames that have different lengths, and keep all the "missing values". The problem is that there are not really missing values, but rather just less of one value than another.

Example: df1 looks like this:

Shrub value period
1     0.5   1  
2     0.6   1
3     0.7   1
4     0.8   1 
5     0.9   1
10    0.9   1
1     0.4   2
5     0.4   2
6     0.5   2
7     0.3   2
2     0.4   3
3     0.1   3
8     0.5   3
9     0.2   3

df2 looks like this:

Shrub   x    y
1       5    8
2       6    7
3       3    2
4       1    2
5       4    6 
6       5    9
7       9    4 
8       2    1   
9       4    3
10      3    6

and i want the combined dataframe to look like:

Shrub   x    y    value   period
1       5    8    0.5        1
2       6    7    0.6        1
3       3    2    0.7        1
4       1    2    0.8        1
5       4    6    0.9        1 
6       5    9    NA         1
7       9    4    NA         1 
8       2    1    NA         1   
9       4    3    NA         1
10      3    6    0.9        1
1       5    8    0.4        2
2       6    7    NA         2
3       3    2    NA         2
4       1    2    NA         2
5       4    6    0.4        2
6       5    9    0.5        2
7       9    4    0.3        2
8       2    1    NA         2
9       4    3    NA         2
10      3    6    NA         2
1       5    8    NA         3
2       6    7    0.4        3
3       3    2    0.1        3
4       1    2    NA         3
5       4    6    NA         3
6       5    9    NA         3
7       9    4    NA         3 
8       2    1    0.5        3   
9       4    3    0.2        3
10      3    6    NA         3

I have tried the merge command using all = TRUE, but this does not give me what i want. I haven't been able to find this anywhere so any help is appreciated!

Frank
  • 66,179
  • 8
  • 96
  • 180
Logan
  • 41
  • 1
  • 3
  • Please supply [a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including the complete code you've already tried and the data for both data.frames using `dput`. – Thomas Aug 28 '15 at 20:43
  • 1
    The merge will not invent the missing value that are not in your data frame. – HubertL Aug 28 '15 at 20:49

3 Answers3

7

This is a situation where complete from package tidyr is useful (this is in tidyr_0.3.0, which is currently available on on github). You can use this function to expand df1 to include all period/Shrub combinations, filling the other variables in with NA by default. Once you do that you can simply join the two datasets together - I'll use inner_join from dplyr.

library(dplyr) library(tidyr)

First, using complete on df1, showing the first 10 lines of output:

complete(df1, period, Shrub)

Source: local data frame [30 x 3]

   period Shrub value
1       1     1   0.5
2       1     2   0.6
3       1     3   0.7
4       1     4   0.8
5       1     5   0.9
6       1     6    NA
7       1     7    NA
8       1     8    NA
9       1     9    NA
10      1    10   0.9
..    ...   ...   ...

Then all you need to do is join this expanded dataset with df2:

complete(df1, period, Shrub) %>%
    inner_join(., df2)
Source: local data frame [30 x 5]

   period Shrub value x y
1       1     1   0.5 5 8
2       1     2   0.6 6 7
3       1     3   0.7 3 2
4       1     4   0.8 1 2
5       1     5   0.9 4 6
6       1     6    NA 5 9
7       1     7    NA 9 4
8       1     8    NA 2 1
9       1     9    NA 4 3
10      1    10   0.9 3 6
..    ...   ...   ... . .
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • For some reason it tells me I get an error message in which R says that it could not find the complete function. I loaded the tidyr and dplyr packages. This seems really simple though if it works! – Logan Aug 28 '15 at 21:34
  • @Logan The `complete` function is a new and tidyr_0.3.0 isn't on CRAN yet. You can get the development version of *tidyr* [here](https://github.com/hadley/tidyr). It's easy to install this once you have package *devtools* via `devtools::install_github("hadley/tidyr")`. – aosmith Aug 28 '15 at 21:39
5

Start by repeating the rows of df2 to create a "full" dataset (i.e., 30 rows, one for each shrub-period observation), then merge:

tmp <- df2[rep(seq_len(nrow(df2)), times=3),]
tmp$period <- rep(1:3, each = nrow(df2))
out <- merge(tmp, df1, all = TRUE)
rm(tmp) # remove `tmp` data.frame

The result:

> head(out)
  Shrub period x y value
1     1      1 5 8   0.5
2     1      2 5 8   0.4
3     1      3 5 8    NA
4     2      1 6 7   0.6
5     2      2 6 7    NA
6     2      3 6 7   0.4

> str(out)
'data.frame':   30 obs. of  5 variables:
 $ Shrub : int  1 1 1 2 2 2 3 3 3 4 ...
 $ period: int  1 2 3 1 2 3 1 2 3 1 ...
 $ x     : int  5 5 5 6 6 6 3 3 3 1 ...
 $ y     : int  8 8 8 7 7 7 2 2 2 2 ...
 $ value : num  0.5 0.4 NA 0.6 NA 0.4 0.7 NA 0.1 0.8 ...
Thomas
  • 43,637
  • 12
  • 109
  • 140
1

You can use dplyr. This works by taking each period in a seperate frame, and merging with all=TRUE to force all values, then putting it all back together. The cbind(df2,.. part adds on the period to the missing values so we don't get extra NA.:

library(dplyr)
df1 %>% group_by(period) %>%
        do(merge(., cbind(df2, period = .[["period"]][1]), by = c("Shrub", "period"), all = TRUE))

   Shrub period value x y
1      1      1   0.5 5 8
2      2      1   0.6 6 7
3      3      1   0.7 3 2
4      4      1   0.8 1 2
5      5      1   0.9 4 6
6      6      1    NA 5 9
7      7      1    NA 9 4
8      8      1    NA 2 1
9      9      1    NA 4 3
10    10      1   0.9 3 6
11     1      2   0.4 5 8
12     2      2    NA 6 7
13     3      2    NA 3 2
14     4      2    NA 1 2
15     5      2   0.4 4 6
16     6      2   0.5 5 9
17     7      2   0.3 9 4
18     8      2    NA 2 1
19     9      2    NA 4 3
20    10      2    NA 3 6
21     1      3    NA 5 8
22     2      3   0.4 6 7
23     3      3   0.1 3 2
24     4      3    NA 1 2
25     5      3    NA 4 6
26     6      3    NA 5 9
27     7      3    NA 9 4
28     8      3   0.5 2 1
29     9      3   0.2 4 3
30    10      3    NA 3 6
jeremycg
  • 24,657
  • 5
  • 63
  • 74