2

I am collating multiple excel files into one using data frames. There are duplicate columns in the files. Is it possible to merge only the unique columns?

Here is my code:

library(rJava)
library (XLConnect)

data.files = list.files(pattern = "*.xls")

# Read the first file
df = readWorksheetFromFile(file=data.files[1], sheet=1, check.names=F) 

# Loop through the remaining files and merge them to the existing data frame
for (file in data.files[-1]) {
newFile = readWorksheetFromFile(file=file, sheet=1, check.names=F)
    df = merge(df, newFile, all = TRUE, check.names=F)
} 
Unihedron
  • 10,902
  • 13
  • 62
  • 72
wanderer
  • 21
  • 2
  • 2
    Welcome to SO. You are much more likely to receive a helpful answer if you provide a [**minimal, reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Henrik Mar 21 '14 at 20:32
  • See `t()`- transpose and `duplicated()`. – zx8754 Sep 02 '14 at 12:00
  • @zx8754 and what about conversion from data frame to matrix and all the undesired effects of that on eg factors (become numeric, can give wrong results) and so forth? – Joris Meys Sep 02 '14 at 12:07
  • @JorisMeys it was just a comment/starting point as OP didn't provide reproducible data. – zx8754 Sep 02 '14 at 12:08
  • @zx8754 I know, but it's a bad starting point, as neither `t()` nor `duplicated()` can be of any real help in this case. – Joris Meys Sep 02 '14 at 12:09

1 Answers1

1

First of all, if you apply merge correctly, there shouldn't be any duplicated columns, provided that the duplicated columns also have the exact same name in the EXCEL files. As you use merge, there must be at least one column in the EXCEL files that have the exact same name, and contains the values used to merge them.

So I reckon you want to check in the resulting data frame whether there are duplicate columns based on the values in each column. For this, you could use the following:

keepUnique <- function(x){
  combs <- combn(names(x),2)

  dups <- mapply(identical,
                 x[combs[1,]],
                 x[combs[2,]])

  drop <- combs[2,][dups]
  x[ !names(x) %in% drop ]
}

Which gives :

> mydf <- cbind(iris,iris[,3])[1:5,]
> mydf
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species iris[, 3]
1          5.1         3.5          1.4         0.2  setosa       1.4
2          4.9         3.0          1.4         0.2  setosa       1.4
3          4.7         3.2          1.3         0.2  setosa       1.3
4          4.6         3.1          1.5         0.2  setosa       1.5
5          5.0         3.6          1.4         0.2  setosa       1.4
> keepUnique(mydf)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

You can use this after reading in a file, i.e. add the line

newFile <- keepUnique(newFile,df)

in your own code.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263