0

I have two directories, each with many files in them. The files in each directory have the same name. What I'd like to do is apply a function (for instance a correlation, and extract the estimate) on dir1/file1 to dir2/file1, repeat this over all files which match in name, and store the result as a data frame.

I'm trying something like this:

f1 = list.files("path1", "*abc.csv")
f2 = list.files("path2", "*abc.csv")


for (i in 1:length(f1)) {
  tmp <- as.matrix(read.csv(f1[i], header=FALSE)) 
  tmp2 <- as.matrix(read.csv(f2[i], header=FALSE))
  c = cor.test(tmp,tmp2) 
  lst[[f1[i]]] <- c$estimate
}

But I'm a little stuck due to the matching filenames and also thinking that apply plus a match call might be a better choice. I've searched and found solutions on dealing with importing and applying a function to multiple files, but not when importing two batches and the files have identical names.

amurphy
  • 131
  • 1
  • 9
  • Obvious question, but can you slightly change the file names by any chance? Or is it one of those not so fun admin protected server files... – Tony Hellmuth Apr 16 '18 at 13:42
  • @amurphy: Can you not read each folder into separated data frames then run `cor.test`? See [this](https://stackoverflow.com/a/48105838/786542) for efficient ways to read multiple files at once – Tung Apr 16 '18 at 13:54
  • @TonyHellmuth, I'd rather have a solution that doesn't require filename changes due to various restrictions – amurphy Apr 16 '18 at 14:01
  • @Tung thanks - I'll look into that method – amurphy Apr 16 '18 at 14:04
  • 2
    if each folder has identically named files you only need to load one set of names no? And then `lapply` along that list of file names using the two different directory names. Or perhaps I am misunderstanding the question. – gfgm Apr 16 '18 at 14:05
  • Fair enough. Just in case, you can do it in R using `file.rename`. – Tony Hellmuth Apr 16 '18 at 14:06
  • @TonyHellmuth Thanks. I'd normally use a shell script for that kind of operation, but good to know the solution within R – amurphy Apr 16 '18 at 14:10

2 Answers2

2

I think you could do something like this:

get.cor <- function(name, path1 = "path1", path2 = "path2") {
  f1 <- paste0(path1, name)
  f2 <- paste0(path2, name)
  m1 <- as.matrix(read.csv(f1, header = TRUE))
  m2 <- as.matrix(read.csv(f2, header = TRUE))
  cor.test(m1, m2)$estimate
}

# Some toy folders and data
system("mkdir tmpfolder")
system("mkdir tmpfolder2")
set.seed(123)
m1 <- matrix(rnorm(100), nrow=10)
m2 <- matrix(rnorm(100), nrow=10)
cor.test(m1, m2)$estimate
#>         cor 
#> -0.04953215

write.csv(m1, "tmpfolder/f1.csv", row.names = F)
write.csv(m2, "tmpfolder2/f1.csv", row.names = F)

# since names are identical one list of names will suffice
f.names <- list.files("tmpfolder/")

# now apply the function to each file name
lapply(f.names, function(n){get.cor(n, path1 = "tmpfolder/", path2 = "tmpfolder2/")})
#> [[1]]
#>         cor 
#> -0.04953215
gfgm
  • 3,627
  • 14
  • 34
  • Thank you, this is very close - I think instead of read.csv it'd have to be list.files to match a pattern, such as only importing files that match *abc.csv? I've updated my question to reflect the need for the pattern match. – amurphy Apr 16 '18 at 14:25
  • if you just edit the list.files line to read `list.files("tmpfolder/", "*abc.csv")` should work. The general principal is get your single list of file names, your two paths to the folders, and then you can iterate along the list of names. – gfgm Apr 16 '18 at 14:28
0

I would first read all files as matrices, then get all correlations using mapply, which is faster and neater.

#read file paths
f1 = list.files("path1", "*.csv")
f2 = list.files("path2", "*.csv")

# order the files so they match each other in both lists
f1 = f1[order(f1)]
f2 = f2[order(f2)]

#load them as matrices
f11 = lapply(f1, function(x) as.matrix(read.csv(x))
f22 = lapply(f2, function(x) as.matrix(read.csv(x))

# generate the correlations
cor_tests = mapply(cor.test, f11, f22)

An example with dummy data

f1 = list(rnorm(100), rnorm(100))
f2 = list(2*rnorm(100), 2*rnorm(100))

ab = mapply(cor.test, f1, f2)
ab[rownames(ab) == "estimate"]
[[1]]
       cor 
-0.1024785 

[[2]]
      cor 
0.1020779
Felipe Alvarenga
  • 2,572
  • 1
  • 17
  • 36