2

I know that this question has been asked previously, but answers to the previous posts cannot seem to solve my problem.

I have dozens of tab-delimited .txt files. Each file has two columns ("pos", "score"). I would like to compile all of the "score" columns into one file with multiple columns. The number of rows in each file varies and they are irrelevant for the compilation.

If someone could direct me on how to accomplish this, preferably in R, it would be a lot of helpful.

Alternatively, my ultimate goal is to read the median and mean of the "score" column from each file. So if this could be accomplished, with or without compiling the files, it would be even more helpful.

Thanks.

UPDATE:

As appealing as the idea of personal code ninjas is, I understand this will have to remain a fantasy. Sorry for not being explicit.

I have tried lapply and Reduce, e.g.,

> files <- dir(pattern="X.*\\.txt$")
> File_list <- lapply(filesToProcess,function(score)
+  read.table(score,header=TRUE,row.names=1))
> File_list <- lapply(files,function(z) z[c("pos","score")])
> out_file <- Reduce(function(x,y) {merge(x,y,by=c("pos"))},File_list)

which I know doesn't really make sense, considering I have variable row numbers. I have also tried plyr

> files <- list.files()
> out_list <- llply(files,read.table)

As well as cbind and rbind. Usually I get an error message, because the row numbers don't match up or I just get all the "score" data compiled into one column.

The advice on similar posts (e.g., Merging multiple csv files in R, Simultaneously merge multiple data.frames in a list, and Merge multiple files in a list with different number of rows) has not been helpful.

I hope this clears things up.

Community
  • 1
  • 1
stephanhart
  • 31
  • 1
  • 3
  • 2
    Stephan - could you please show us what have you tried before, and what didn't work? Your problem can be solved in R quite easily, but as stated, it's very vague. SO users are not your personal code ninjas that write code for you. – Victor K. May 02 '13 at 18:23
  • ?merge. how does setting all = FALSE (the default) not do what you want? – Jake Burkhead May 02 '13 at 19:03
  • @Jake To merge the files, I would have to list them one-by-one, which is what I'm trying to avoid. – stephanhart May 02 '13 at 19:13
  • You can use a loop with `merge()` inside. – Rcoster May 02 '13 at 19:26
  • 1
    I do not see why you are using `merge`. That is a database JOIN. From your description it sounded like you would be using either `cbind.data.frame` (but only if they had the ame number of rows) or perhaps `rbind.data.frame` preceded by addition of a file of origin column. Most analyses in R will proceed more smoothly if you leave data in the "long" form rather than laying columns side by side. – IRTFM May 02 '13 at 19:29

2 Answers2

1

This problem could be solved in two steps:

Step 1. Read the data from your csv files into a list of data frames, where files is a vector of file names. If you need to add extra arguments to read.csv, add them like shown below. See ?lapply for details.

list_of_dataframes <- lapply(files, read.csv, stringsAsFactors = FALSE)

Step 2. Calculate means for each data frame:

means <- sapply(list_of_dataframes, function(df) mean(df$score))

Of course, you can always do it in one step like this:

means <- sapply(files, function(filename) mean(read.csv(filename)$score))
Victor K.
  • 4,054
  • 3
  • 25
  • 38
0

I think you want smth like this:

all_data = do.call(rbind, lapply(files,
                                 function(f) {
                                   cbind(read.csv(f), file_name=f)
                                 }))

You can then do whatever "by" type of action you like. Also, don't forget to adjust the various read.csv options to suit your needs.

E.g. once you have the above, you can do the following (and much more):

library(data.table)
dt = data.table(all_data)

dt[, list(mean(score), median(score)), by = file_name]

A small note: you could also use data.table's fread, to read the files in instead of the read.table and its derivatives, and that would be much faster, and while we're at it, use rbindlist instead of do.call(rbind,.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • This gets me part of the way there, except it outputs three columns: "pos", "score" and "file_name", so that all the data are listed continuously in a column, rather than separated into columns per file. Any suggestions? – stephanhart May 02 '13 at 20:06