2

Imagine I have 100 csv files, all of them have common subject ID in first column, but other columns are totally different from each other. I would like to have one excel file that has ID in first column and other columns are all the columns in all of other csv files. How can I do this? I cannot merge every two of them and merge them to other one in R. This would be frustrating.

Consider:

file1.csv with 3 columns "subjectID","a","b"
file2.csv with 3 columns "subjectID", "c","d"
file3.csv with 2 columns "subjectID", "e"

Finally I would like to have one csv file with 6 columns:

"subjectID","a","b","c","d","e" 
Helia
  • 228
  • 1
  • 10
  • Can you be more specific about your expected output? Perhaps with some example csv input and output? Doesn't have to be 100 csv files, a few small ones should be sufficient. – Richie Thomas Aug 09 '18 at 02:08

1 Answers1

2

You could do the following

# Read in files 
#lst <- lapply(files, read.csv)

# Generate similar sample data to demonstrate
lst <- list(
    data.frame(subjectID = letters[1:10], a = runif(10), b = runif(10)),
    data.frame(subjectID = letters[1:10], c = runif(10), d = runif(10)),
    data.frame(subjectID = letters[1:10], e = runif(10), f = runif(10)),
    data.frame(subjectID = letters[1:10], g = runif(10), h = runif(10)))

# Merge data from all files on subjectID
Reduce(function(x, y) merge(x, y, by = "subjectID"), lst)
#   subjectID         a           b         c          d         e           f
#1          a 0.3303817 0.297198993 0.9521621 0.07472854 0.8422689 0.642384618
#2          b 0.4693850 0.029617471 0.1079085 0.97297463 0.8047761 0.002465216
#3          c 0.1232060 0.351755203 0.4649148 0.97412774 0.3047000 0.290868067
#4          d 0.7906051 0.402014018 0.7141169 0.69951165 0.4372228 0.142227230
#5          e 0.3958683 0.119870791 0.1061828 0.07939243 0.5506707 0.276125793
#6          f 0.8460007 0.032571856 0.4205542 0.03433463 0.4095929 0.561597813
#7          g 0.3087469 0.002836689 0.6625422 0.43830865 0.5944669 0.186904600
#8          h 0.3501046 0.599942351 0.2073871 0.11963722 0.7769929 0.367783960
#9          i 0.7952080 0.400595114 0.9792009 0.30959206 0.5644129 0.122465491
#10         j 0.3829504 0.972797955 0.9483458 0.93079712 0.2273367 0.726364011
#           g          h
#1  0.3224803 0.09905568
#2  0.9986640 0.42053490
#3  0.5484119 0.88754806
#4  0.3274199 0.87417816
#5  0.9474794 0.40207119
#6  0.3864848 0.97977549
#7  0.4875860 0.31788236
#8  0.5094075 0.86424560
#9  0.3900625 0.11860494
#10 0.7064986 0.11939311

Note that I have generated a list of 4 data.frames as sample data; all data.frames share a common column subjectID. In your case, you will be reading in data with e.g. read.csv based on filenames given in files.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Thanks you so much for your response. I run this code: lst <- list.files(path = "/path/to/csvfiles", pattern = NULL, all.files = FALSE, full.names = TRUE) Reduce(function(x, y) merge(x, y, by = "subjectID"), lst) I get the error: Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column. Could please help? Thanks so much – Helia Aug 09 '18 at 02:18
  • @Helia You may need to change the name of the column by which to merge the datasets. In my example, I assumed its name was `"subjectID"`. Probably this is different in your case. You should change `"subjectID"` to whatevery the name is of the column that is common to all files. – Maurits Evers Aug 09 '18 at 03:12
  • I changed it to "subjectID" too. I still have that error. – Helia Aug 09 '18 at 03:24
  • @Helia. Hmm, do you mean you changed the column names of your original datasets? `"subjectID"` needs to be the name of the unique column that is common to all CSV files. There must be one and only one column with that name per file. – Maurits Evers Aug 09 '18 at 03:34
  • @Helia The example from this post should work. So there must be something different with your actual data. It's difficult to provide help without knowing anything about your CSV files. Can you edit your post to include the top 6 lines of three of your CSV files? – Maurits Evers Aug 09 '18 at 13:46