0

I have many data file (*.dat) with columns (all files in the same format), I want to pick up one column (the same position in all files) of each file and merge them in to a data frame.

Could you tell me is it possible in R? And how?


I have about 200 files (23x7). Here is my data file:

  12000.0000       77.17   59.09    0.17        1.82   59.09   32.4564
   6000.0000       77.52   32.68    0.11        1.83   32.68   17.8731
   3000.0000       77.80   18.98    0.12        1.83   18.98   10.3449
   1500.0000       77.99   11.23    0.12        1.84   11.23    6.1084
    750.0000       78.13    6.93    0.15        1.84    6.93    3.7636
    375.0000       78.21    4.53    0.28        1.84    4.53    2.4552
    187.5000       78.27    3.37    0.51        1.85    3.37    1.8253
     93.7500       78.36    2.84    0.99        1.85    2.84    1.5387
     46.8750       78.48    2.04    1.37        1.85    2.04    1.1049
     23.4375       78.53    0.98    0.17        1.85    0.98    0.5291
     11.7188       78.52   -0.23    0.15        1.85   -0.23   -0.1242
      5.8594       78.48   -0.74    0.08        1.85   -0.74   -0.3973
      2.9297       78.44   -0.83    0.03        1.85   -0.83   -0.4499
      1.4648       78.43   -1.49    0.06        1.85   -1.49   -0.8059
      0.7324       78.40   -3.20    0.15        1.85   -3.20   -1.7297
      0.3662       78.24   -5.33    0.04        1.85   -5.33   -2.8879
      0.1831       77.94   -6.84    0.07        1.84   -6.84   -3.7212
      0.0916       77.71   -5.76    0.08        1.83   -5.76   -3.1449
      0.0458       77.35   -3.57    0.11        1.82   -3.57   -1.9588
      0.0229       77.44   -0.88    0.13        1.83   -0.88   -0.4810
      0.0114       77.31    0.72    0.23        1.82    0.72    0.3928
      0.0057       77.59    1.63    0.51        1.83    1.63    0.8929
      0.0029       77.61    0.34    2.65        1.83    0.34    0.1841

I want to take column 6 and combine with 6th columm from other files to make a matrix (23x200).

Marek
  • 49,472
  • 15
  • 99
  • 121
Nam Van
  • 1
  • 1
  • 2
  • 1
    You should add a sample of a file so that we know what the separator is, whether it has headers, and how you can be sure the number of rows in each will be the same. – mdsumner May 13 '11 at 05:38
  • and also what the class of the desired column is meant to be, since using "colClasses" is a neat way to read only some columns – mdsumner May 13 '11 at 05:56

2 Answers2

2

Another way of doing this (based on @mdsumner's answer) would be (untested):

# get a list of files
my.file.list <- list.files(pattern = "dat$")
# for each file, run read.table and select only the first column
my.list <- lapply(X = my.file.list, FUN = function(x) {
            read.table(x, colClasses = c("NULL", "NULL", "numeric", "NULL"), sep = ",")[,1]
        })
# merge columns that are in a list into one data.frame
my.df <- do.call("cbind", my.list)
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

If you want the third column of a file, that has headers and comma separator:

d <- read.table("file.dat", colClasses = c("NULL", "NULL", "numeric", "NULL"), sep = ",")

Replace "numeric" with another class as appropriate - and as many "NULL"s as required to describe every column in the file.

To get every file called "*.dat" in the current directory:

fs <- list.files(pattern = "dat$")

To build the matrix up from all those columns and the same classes and number of columns as above:

mat <- NULL
for (i in 1:length(fs)) {
 mat <- cbind(mat, read.table(fs[i], colClasses = c("NULL", "NULL", "numeric", "NULL"), sep = ",")
}

For reasonably large data files you should pre-allocate the matrix, which you can find out in one go by reading one file (and assuming they all have the same number of rows, as well as structure:

d0 <- read.table(fs[1], colClasses = c("NULL", "NULL", "numeric", "NULL"), sep = ",")[,1]
nr <- nrow(d0)

Now the loop above becomes more memory efficient with:

mat <- matrix("numeric", nrow = nr, ncol = length(fs))
for (i in 1:length(fs)) {
 mat[,i] <- read.table(fs[i], colClasses = c("NULL", "NULL", "numeric", "NULL"), sep = ",")[,1]
}
mdsumner
  • 29,099
  • 6
  • 83
  • 91