4

I am new to R and am looking for a code to manipulate hundreds of files that I have at hand. They are .txt files with a few rows of unwanted text, followed by columns of data, looking something like this:

XXXXX 
XXXXX
XXXXX
Col1 Col2 Col3 Col4 Col5
1 36 37 35 36 
2 34 34 36 37 
. 
. 
1500 34 35 36 35 

I wrote a code (below) to extract selected rows of columns 1 and 5 of an individual .txt file, and would like to do a loop for all the files that I have.

data <- read.table(paste("/Users/tan/Desktop/test/01.txt"), skip =264, nrows = 932)
selcol<-c("V1", "V5")
write.table(data[selcol], file="/Users/tan/Desktop/test/01ed.txt", sep="\t")

With the above code, the .txt file now looks like this:

Col1 Col5  
300 34  
. 
. 
700 34  

If possible, I would like to combine all the Col5 of the .txt files with one of Column 1 (which is the same for all txt files), so that it looks something like this:

Col1 Col5a Col5b Col5c Col5d ...
300 34 34 36 37
. 
. 
700 34 34 36 37 

Thank you! Tan

Chase
  • 67,710
  • 18
  • 144
  • 161
Tan
  • 45
  • 1
  • 5
  • 1
    welcome to the site! The stack overflow site uses the {} button to format code, or indent four spaces. I went through and tried to format the code the way I think it should be based on the tags you had in there. You can change it back if I've messed something up. – Chase Jun 21 '11 at 04:17
  • great job! it looks much clearer now... thank you so much, Chase! :) – Tan Jun 21 '11 at 04:21

1 Answers1

5

Alright - I think I hit on all your questions here, but let me know if I missed something. The general process that we will go through here is:

  1. Identify all of the files that we want to read in and process in our working directory
  2. Use lapply to iterate over each of those file names to create a single list object that contains all of the data
  3. Select your columns of interest
  4. Merge them together by the common column

For the purposes of the example, consider I have four files named file1.txt through file4.txt that all look like this:

    x           y          y2
1   1  2.44281173 -2.32777987
2   2 -0.32999022 -0.60991623
3   3  0.74954561  0.03761497
4   4 -0.44374491 -1.65062852
5   5  0.79140012  0.40717932
6   6 -0.38517329 -0.64859906
7   7  0.92959219 -1.27056731
8   8  0.47004041  2.52418636
9   9 -0.73437337  0.47071120
10 10  0.48385902  1.37193941

##1. identify files to read in
filesToProcess <- dir(pattern = "file.*\\.txt$")
> filesToProcess
[1] "file1.txt" "file2.txt" "file3.txt" "file4.txt"


##2. Iterate over each of those file names with lapply
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE))

##3. Select columns x and y2 from each of the objects in our list
listOfFiles <- lapply(listOfFiles, function(z) z[c("x", "y2")])

##NOTE: you can combine steps 2 and 3 by passing in the colClasses parameter to read.table.
#That code would be:
listOfFiles <- lapply(filesToProcess, function(x) read.table(x, header = TRUE
  , colClasses = c("integer","NULL","numeric")))

##4. Merge all of the objects in the list together with Reduce. 
# x is the common columns to join on
out <- Reduce(function(x,y) {merge(x,y, by = "x")}, listOfFiles)
#clean up the column names
colnames(out) <- c("x", sub("\\.txt", "", filesToProcess))

Results in the following:

> out
    x       file1        file2       file3        file4
1   1 -2.32777987 -0.671934857 -2.32777987 -0.671934857
2   2 -0.60991623 -0.822505224 -0.60991623 -0.822505224
3   3  0.03761497  0.049694686  0.03761497  0.049694686
4   4 -1.65062852 -1.173863215 -1.65062852 -1.173863215
5   5  0.40717932  1.189763270  0.40717932  1.189763270
6   6 -0.64859906  0.610462808 -0.64859906  0.610462808
7   7 -1.27056731  0.928107752 -1.27056731  0.928107752
8   8  2.52418636 -0.856625895  2.52418636 -0.856625895
9   9  0.47071120 -1.290480033  0.47071120 -1.290480033
10 10  1.37193941 -0.235659079  1.37193941 -0.235659079
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Thanks Chase, for your answer. It doesn't seem to work though. Would it be because my initial files are not entirely tables as there are rows of text before the table starts? Would it solve the problem if I were to iterate over the files with my initial code for a single file first, and then follow with your steps? – Tan Jun 21 '11 at 06:18
  • @Tan where @Chase has the `read.table(....)` calls, just add in your arguments `skip = 264, nrows = 932`. For example, where he has `function(x) read.table(x, header = TRUE)` change it to be `function(x) read.table(x, header = TRUE, skip = 264, nrows = 932)` but do make sure all the files have 932 rows. If the don't and after the non-data lines, the rest of the file is data, just leave off the `nrow = 932` bit. – Gavin Simpson Jun 21 '11 at 07:56
  • 2
    @Chase wouldn't it be simpler to do away with the `merge()` step and just read the column we want, and then do an extra read for the duplicate column? We can use `sapply()` to return the matrix of columns wanted rather than leave them as a list with `lapply()`. Something like: `cbind(read.table(filesToProcess[1], header = TRUE)[,1], sapply(filesToProcess, function(x) read.table(x, header = TRUE)[,3]))` (Sorry that will wrap horribly.) And then clean the column names up? – Gavin Simpson Jun 21 '11 at 08:04
  • @Gavin - that's a very good point. I was really trying to find an excuse to use `do.call()` and merge, but that didn't work as planned so I ended up with `Reduce()`. `cbind()` and `sapply()` are certainly less esoteric than `Reduce()` too. Thanks for the tip. – Chase Jun 21 '11 at 11:32
  • rather than reading all columns and discarding most of them later, I'd select the ones to read with `colClasses` in `read.table`. One thing that may go wrong with your procedure is if you have varying number of lines to skip / read, in which case `readLines` + `grep` may be more appropriate. One final tip; with the reshape package you can readily `melt` the output of `lapply` and obtain a long format data.frame with all your observations tagged with variable `L1` referring to the original file. – baptiste Jun 21 '11 at 23:11
  • @Gavin and @Chase - thanks! It is going well until `listOfFiles <- lapply(listOfFiles, function(z) z[c("x", "y2")])`, where an error message returns: `Error in [.data.frame(z, c("x", "y2")) : undefined columns selected` What can I do to resolve this? Thanks!!! – Tan Jun 22 '11 at 01:02
  • I figured the one above out... when I remove `header=TRUE`, it works fine, probably because of the preceding lines of text before the table starts for each file. I'm now stumped at `out <- Reduce(function(x,y) {merge(x,y, by = "x")}, listOfFiles)` - Error message: `Error in fix.by(by.x, x) : 'by' must specify valid column(s)` and `colnames(out) <- c("x", sub("\\.txt", "", filesToProcess))`- Error message: `Error in `colnames<-`(`*tmp*`, value = c("x", "00.jaz", "01.jaz", "02.jaz", : attempt to set colnames on object with less than two dimensions` – Tan Jun 22 '11 at 01:48
  • @Tan - you need to update the column names to reflect the columns in your actual data. If you look at the example I posted, the "common" column was named "x", and the second column we extracted was named "y2". You'll want to update "x" and "y2" with whatever the columns are named in your data files. – Chase Jun 22 '11 at 02:16