-1

I have 3 large csv files (OCA1 = 3649 observations, 521 variables, OCA2 = 3772 observations, 2513 variables, OCA3 = 878 observations, 2513 variables). I want to combine them into 1 csv file in R. My only concern is they have different columns however, the first 10 columns for each file are the same. Here is an example:

OCA1: enter image description here

OCA2: enter image description here

As you can see, up until "Format" the column names are the same. What I would like is for the desired output to look like this:

#CHROM      POS    ID   REF  ALT  QUAL   FILTER   INFO  FORMAT  NA06984  NA006985  HG00096  HG00097
   11      891...  rs..  A   G    100      PASS    ..    GT       0|0    0|0
   11      891...  rs..  A   G    100      PASS    ..    GT                       0|0    0|0

Where the columns after "Format" from OCA2 get added to OCA1 and the rows from OCA2 get added after the last OCA1 observation (3649).

I initially tried rbind but I was struggling due to the columns.

temsandroses
  • 311
  • 1
  • 3
  • 11
  • 2
    Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – NelsonGon Apr 03 '19 at 15:36
  • `merge` and rearrange columns. – NelsonGon Apr 03 '19 at 15:37

2 Answers2

0

bind_rows from dplyr can help bind data sets with different number of columns. Here is an example:

library(dplyr)

OCA1 <- data_frame(
  x = 1:3
)

OCA2 <- data_frame(
  x = 1:5,
  y = letters[1:5]
)

OCA3 <- data_frame(
  x = 1:10,
  y = letters[1:10],
  z = LETTERS[1:10]
)

df <- bind_rows(
  OCA1,
  OCA2,
  OCA3
)
OzanStats
  • 2,756
  • 1
  • 13
  • 26
  • I used this just now combining OCA1 and OCA2 but the number of variables doesn't change. OCA2 has 2513 and the df I created has the same amount when I wanted to combine them such that it would be 2513 + 521 - 10 (for the first 10 columns that are the same) = 3024 variables. – temsandroses Apr 03 '19 at 15:20
  • @temsandroses; you may want to make sure that your variable names are not the same except for the overlapping 10 columns. The community would be more helpful if you could share a reproducible example. – OzanStats Apr 03 '19 at 15:34
  • I managed to get it to work testing on a subset of the data! As the files are very large its hard for me to determine which ones are duplicates. – temsandroses Apr 03 '19 at 15:42
0

Maybe you could create NA columns for missing ones, and then do rbind

df_l = list(df1 = data.frame('A'=rep("f1",10),'B'=runif(10),'C'=seq(1:10)),
            df2 = data.frame('A'=rep("f2",20),'B'=runif(20),'D'=paste0("X",seq(1:20))),
            df3 = data.frame('A'=rep("f3",30),'C'=seq(1:30),'D'=paste0("Y",seq(1:30))))
all_names = unique(c(colnames(df_l[['df1']]),
                     colnames(df_l[['df2']]),
                     colnames(df_l[['df3']]))) 

for (i in names(df_l)) {
  abs_col = all_names[!all_names %in% names(df_l[[i]])]
  if(length(abs_col) > 0) df_l[[i]][,abs_col] <- NA
} ; rm(i)

do.call("rbind", df_l)
ophdlv
  • 254
  • 1
  • 6