1

I’m looking to do the following in R.

I have 250+ csv files of chromatographic data structured similarly to the example below, but with 21 rows instead of three:

1  4.708252    BB    9.946890   7.830349  0.01982016   4.684836   4.742056
2  4.970352    BB    1.792341  1.497008   0.01896829   4.945352   5.005390
3  6.393414    BB    6.599891  5.309925   0.01950091   6.368413   6.428723

What I want to do is read a subset of the data in all 250 files into a single data frame, which is easy enough — but I also need to restructure it a fair bit.

Every row in the table above is a peak. I only want the data from the first and fourth columns (which are ‘peak number’ and ‘area under the peak’, respectively), and in the output I need to make each peak an individual column, rather than a row as above, with the peak number as the header. Finally, I want to create a new column where each row (that is, the data from each individual csv file) is given the same name as the csv file name.

So, imagine I have 3 files: ABC1.csv, ABC2.csv, and ABC3.csv. Each file looks like my example above. I want to automatically take all those files and merge them into a single data frame such as the one below.

ID          1         2          3     
ABC1     9.94689   1.792341   6.599891   
ABC2     9.76651   1.932332   6.600022
ABC3     8.99193   2.556471   6.718934

I hope I’ve made this clear enough. I’ve been able to manage most of the steps but haven’t been successful writing them into a single script. And I have no idea how, if there is any way, to make the file name into a variable.

Cheers

Ricky
  • 4,616
  • 6
  • 42
  • 72
SMZ
  • 21
  • 4
  • In your description, you said the columns that are important are 1 and 3. Where is the first column iin the output? – akrun Dec 18 '15 at 02:54
  • Sorry, I didn't explain very well. The first column represents the peak number, which in the output are the column headers. – SMZ Dec 18 '15 at 03:14
  • just to clarify, for your example (which becomes row 1 in output) you're referring to column 1 being the number `1`, and column 3 being the number `9.946890` ? – Ricky Dec 18 '15 at 03:21
  • Yes, that's right. And I've just realised what I mean is column 4, not 3. For some reason I was ignoring the fact that peak number is a column... sorry about that – SMZ Dec 18 '15 at 03:42
  • Please also include the column names in the input dataset. – akrun Dec 18 '15 at 04:39

3 Answers3

2

I am assuming the working directory is set to where the files are. Then you can get the list of files below.

filenames <- list.files()

Have a helper function to read a file and keep just columns 1 and 4.

readdata <- function(filename) {
  df <- read.csv(filename)
  vec <- df[, 4]
  names(vec) <- df[, 1]
  return(vec)
}

Loop over all of the files and rbind them

result <- do.call(rbind, lapply(filenames, readdata))

Name them as you like

row.names(result) <- filenames
Ricky
  • 4,616
  • 6
  • 42
  • 72
  • Neat, I'm going to change all my code to use this now :) – Mist Dec 18 '15 at 04:57
  • BTW I'm assuming there's same number of rows in every file (i.e. every column in output will be filled), since you didn't define what should happen if one file yields 3 columns of output and another yields e.g. 5 columns of output. – Ricky Dec 18 '15 at 04:57
  • 1
    `data.table::fread(file, select = c(1, 4))` would be much more efficient than reading the entire file and then removing columns. – Rich Scriven Dec 18 '15 at 05:54
  • @RichardScriven I agree. Was just putting together a `base` R solution that can run without me testing it. Am not familiar enough with `data.table` to do so. – Ricky Dec 18 '15 at 06:10
  • Actually, I spoke too soon. @Ricky this is perfect except that the new data frame doesn't have the data from the first row in each csv file (peak 1). – SMZ Dec 21 '15 at 05:38
  • I suspect your csv file does not have column header name then? Does changing `read.csv(filename)` in the function to `read.csv(filename, header=F)` solve the problem? – Ricky Dec 21 '15 at 05:52
  • @Ricky Perfect — thanks! Yep, the software producing the data files doesn't do headers for some reason. – SMZ Dec 21 '15 at 07:34
  • Cool. You can accept this as answer then to close the question. Good luck! – Ricky Dec 21 '15 at 07:38
0

this following code can probably be of some help, though the file name is still not working properly -

    path <- "C:\\Users\\Vidyut\\"
    filenames <- list.files(path = path,pattern = ".csv")

    l <- data.frame(ID=character(),col1=numeric(),col2=numeric(),col3=numeric(),stringsAsFactors=FALSE)
    for (i in filenames) {
      #i = filenames[1]
      full = paste(path,i,sep="")
      m <- read.csv(full, header=F)

      # extract the subset of rows required from each file
      # m <- m[c(),]

      n<- m[,c(1,4)]
      y <- gsub('.csv','',i)
      print("y=")
      print(y)
      d <- list(ID=as.character(y),col1=n[1,2],col2=n[2,2],col3=n[3,2])
      print("d=")
      print(d)
      l <- rbind.data.frame(l,d)
      print("l=")
      print(l)
    }

Mind you, this is not very pretty code - just something hacked together to get the job done (visible from the multiple print lines scattered across).

vsdaking
  • 476
  • 9
  • 22
0

Here's a solution for you. This only works if we can assume that there are exactly 21 peaks in each file and they are in order 1:21. If that's not the case a few changes to the code should remedy this.

folder = "c:/temp/"
files <- dir(folder)

first_loop <- TRUE
for (file in files) {
  # Read one file, only the first and fourth columns
  temp <- read.csv(file=paste0(folder,file), 
                   header = FALSE, 
                   colClasses = c("integer", "NULL", "NULL", "numeric",     "NULL", "NULL", "NULL", "NULL"))
  # Transpose the data
  temp <- data.frame(t(temp))
  # Remove the peak number
  temp <- temp[2,]

  # Concatenate the dataframes together
  temp$file <- file
  if (first_loop) {
    data <- temp
    first_loop <- FALSE
  } else {
    data <- rbind(data, temp)
  }
}
data
Mist
  • 1,888
  • 1
  • 14
  • 21