0

Apologies if this has been answered else where. I'm new to R, and have spent all of my 2 days using it trying to get past this initial hurdle.

I've been given a data set with approximately 2000 separate data files. I would like to merge them in to one very large data set. I've found a couple of ways that people suggest work, but none have worked for me. For example, one blog (http://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/) recommends using the following code:

setwd("target_dir/")

file_list <- list.files()

for (file in file_list){

  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }

  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }

}

When I use this code (changing 'target_dir' to the correct directory), R presents me with the following:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

My hunch is that I've either not changed one of the variables within the code which I need to so that it relates to my specific data (I changed the 'target_dir' to the correct directory, but didn't change anything else), or it is because the .dat files don't have any column headings. If this is the case, my second question is whether there is a way of creating the same column headings for multiple .dat files using R.

Many thanks for taking the time to read this.

  • Without seeing the data, it's really difficult to judge. It sounds like the datasets have different numbers of columns and/or different header names. (Additionally, you will be reading in your first dataset twice, since the second `if` statement will see the `dataset` created from the first. Instead use `if (exists("dataset")) { ...read.table... } else { ...read.table...rbind... }`.) – r2evans May 05 '14 at 15:00
  • (BTW: this is a plug for providing [reproducible examples](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Read and heed, you will get more/faster/better responses.) – r2evans May 05 '14 at 15:01
  • If there is no header you neeed to read with header = FALSE, and you will to to create an empty data.table with suitable column names on it. – James King May 05 '14 at 15:02

1 Answers1

3

Try this:

setwd("target_dir/")

file_list <- list.files()

for (file in file_list){

  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=FALSE, sep="\t", 
               col.names = c("a", "b", "c"))
  }

  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=FALSE, sep="\t",
    col.names = c("a", "b", "c"))
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }
}

Where you would replace c("a", "b", "c") with the names you want to use for the columns. Or leave out the col.names parameter and R will use V1, V2, etc.

However it is better to not use a for loop, as pointed out in the comment. Use lapply to read in all the dataframes and the do.call(rbind, ...) or plyr::rbind.all to stack up the dataframes you have read.

James King
  • 6,229
  • 3
  • 25
  • 40
  • +1 for fixing the problem, but this code commits the cardinal sin of growing an object in a loop. It's better to use `lapply` and `do.call(rbind, ...)`. – Roland May 05 '14 at 15:19
  • You're right, I copied OP code and adjusted it. `do.call(rbind, ...)` or `plyr::rbind.all` is the way to go. Will edit the answer. – James King May 05 '14 at 15:23