5

I want to append one data frame to another (the master one). The problem is that only subset of their columns are common. Also, the order of their columns might be different.

Master dataframe:

   a b  c
r1 1 2 -2
r2 2 4 -4
r3 3 6 -6
r4 4 8 -8

New dataframe:

      d  a   c
r1 -120 10 -20
r2 -140 20 -40

Expected result:

    a   b    c
r1  1   2   -2
r2  2   4   -4
r3  3   6   -6
r4  4   8   -8
r5 10 NaN  -20
r6 20 NaN  -40

Is there any smart way of doing this? This is a similar question but the setup is different.

Community
  • 1
  • 1
Szilard
  • 315
  • 1
  • 4
  • 10

4 Answers4

6

Check out the bind_rows function in the dplyr package. It will do some nice things for you by default, such as filling in columns that exist in one data.frame but not the other with NAs instead of just failing. Here is an example:

# Use the dplyr package for binding rows and for selecting columns
library(dplyr)

# Generate some example data
a <- data.frame(a = rnorm(10), b = rnorm(10))
b <- data.frame(a = rnorm(5), c = rnorm(5))

# Stack data frames
bind_rows(a, b)

Source: local data frame [15 x 3]

            a          b          c
1   2.2891895  0.1940835         NA
2   0.7620825 -0.2441634         NA
3   1.8289665  1.5280338         NA
4  -0.9851729 -0.7187585         NA
5   1.5829853  1.6609695         NA
6   0.9231296  1.8052112         NA
7  -0.5801230 -0.6928449         NA
8   0.2033514 -0.6673596         NA
9  -0.8576628  0.5163021         NA
10  0.6296633 -1.2445280         NA
11  2.1693068         NA -0.2556584
12 -0.1048966         NA -0.3132198
13  0.2673514         NA -1.1181995
14  1.0937759         NA -2.5750115
15 -0.8147180         NA -1.5525338

To solve the problem in your question, you would want to select for the columns in your master data.frame first. If a is the master data.frame, and b contains data that you want to add, you can use the select function from dplyr to get the columns that you need first.

# Select all columns in b with the same names as in master data, a
# Use select_() instead of select() to do standard evaluation.
b <- select_(b, names(a))

# Combine
bind_rows(a, b)

Source: local data frame [15 x 2]

            a          b
1   2.2891895  0.1940835
2   0.7620825 -0.2441634
3   1.8289665  1.5280338
4  -0.9851729 -0.7187585
5   1.5829853  1.6609695
6   0.9231296  1.8052112
7  -0.5801230 -0.6928449
8   0.2033514 -0.6673596
9  -0.8576628  0.5163021
10  0.6296633 -1.2445280
11  2.1693068         NA
12 -0.1048966         NA
13  0.2673514         NA
14  1.0937759         NA
15 -0.8147180         NA
ialm
  • 8,510
  • 4
  • 36
  • 48
  • I use R version 3.4.4 and found that `select_()` is deprecated. Normal `select()` does not produce wanted result. – vtenhunen Jan 02 '20 at 22:22
  • 1
    @vtenhunen The `dplyr` api has changed since my answer was posted. You can check out the details [here](https://dplyr.tidyverse.org/articles/programming.html). In short, instead of `select_()`, try `select(..., !! var)` where `var` is a variable containing column names. – ialm Mar 18 '20 at 16:14
2

try this:

library(plyr) # thanks to comment @ialm
df <- data.frame(a=1:4,b=seq(2,8,2),c=seq(-2,-8,-2))
new <- data.frame(d=c(-120,-140),a=c(10,20),c=c(-20,40))

# we use %in% to pull the columns that are the same in the master
# then we use rbind.fill to put in this dataframe below the master
# filling any missing data with NA values
res <- rbind.fill(df,new[,colnames(new) %in% colnames(df)])

> res
   a  b   c
1  1  2  -2
2  2  4  -4
3  3  6  -6
4  4  8  -8
5 10 NA -20
6 20 NA  40
road_to_quantdom
  • 1,341
  • 1
  • 13
  • 20
2

The dplyr- and plyr-based solutions posted here are very natural for this task using bind_rows and rbind.fill, respectively, though it is also possible as a one-liner in base R. Basically I would loop through the names of the first data frame, grabbing the corresponding column of the second data frame if it's there or otherwise returning all NaN values.

rbind(A, sapply(names(A), function(x) if (x %in% names(B)) B[,x] else rep(NaN, nrow(B))))
#     a   b   c
# r1  1   2  -2
# r2  2   4  -4
# r3  3   6  -6
# r4  4   8  -8
# 5  10 NaN -20
# 6  20 NaN -40
josliber
  • 43,891
  • 12
  • 98
  • 133
1

another option is using rbind.fill from the plyr package

bring in your sample data

toread <- "
a b  c
1 2 -2
2 4 -4
3 6 -6
4 8 -8"
master <- read.table(textConnection(toread), header = TRUE) 
toread <- "
d  a   c
-120 10 -20
-140 20 -40"
to.append <- read.table(textConnection(toread), header = TRUE) 

bind data

library(plyr)
rbind.fill(master, to.append)
Community
  • 1
  • 1
Wyldsoul
  • 1,468
  • 10
  • 16
  • 1
    If you're using `dplyr`, why not just use `bind_rows()`? – ialm Dec 14 '15 at 22:04
  • 1
    @ialm With more careful reading, this answer *doesn't* use any `dplyr` functions (it just loads the package). Worth noting that loading `plyr` *after* loading `dplyr` will mask `dplyr::summarize` and `dplyr::mutate` with the `plyr` versions and is not recommended. – Gregor Thomas Dec 14 '15 at 22:31
  • 1
    @Gregor Yes, I see that now. And for the reasons you highlight in your comment, there is a warning issued if you load `plyr` after `dplyr`, and I believe Hadley recommends loading `plyr` before `dplyr` if you ever need to use both packages. – ialm Dec 14 '15 at 22:36
  • I did not intend to include dplyr in the answer and I removed it. Thanks for pointing out that it was not needed – Wyldsoul Dec 15 '15 at 14:00