Import and apply function to files with the same name in different directories

Question

I have two directories, each with many files in them. The files in each directory have the same name. What I'd like to do is apply a function (for instance a correlation, and extract the estimate) on dir1/file1 to dir2/file1, repeat this over all files which match in name, and store the result as a data frame.

I'm trying something like this:

f1 = list.files("path1", "*abc.csv")
f2 = list.files("path2", "*abc.csv")


for (i in 1:length(f1)) {
  tmp <- as.matrix(read.csv(f1[i], header=FALSE)) 
  tmp2 <- as.matrix(read.csv(f2[i], header=FALSE))
  c = cor.test(tmp,tmp2) 
  lst[[f1[i]]] <- c$estimate
}

But I'm a little stuck due to the matching filenames and also thinking that apply plus a match call might be a better choice. I've searched and found solutions on dealing with importing and applying a function to multiple files, but not when importing two batches and the files have identical names.

Obvious question, but can you slightly change the file names by any chance? Or is it one of those not so fun admin protected server files... — Tony Hellmuth, Apr 16 '18 at 13:42
@amurphy: Can you not read each folder into separated data frames then run `cor.test`? See [this](https://stackoverflow.com/a/48105838/786542) for efficient ways to read multiple files at once — Tung, Apr 16 '18 at 13:54
@TonyHellmuth, I'd rather have a solution that doesn't require filename changes due to various restrictions — amurphy, Apr 16 '18 at 14:01
if each folder has identically named files you only need to load one set of names no? And then `lapply` along that list of file names using the two different directory names. Or perhaps I am misunderstanding the question. — gfgm, Apr 16 '18 at 14:05
Fair enough. Just in case, you can do it in R using `file.rename`. — Tony Hellmuth, Apr 16 '18 at 14:06
@TonyHellmuth Thanks. I'd normally use a shell script for that kind of operation, but good to know the solution within R — amurphy, Apr 16 '18 at 14:10

score 2 · Answer 1 · answered Apr 16 '18 at 14:12

I think you could do something like this:

get.cor <- function(name, path1 = "path1", path2 = "path2") {
  f1 <- paste0(path1, name)
  f2 <- paste0(path2, name)
  m1 <- as.matrix(read.csv(f1, header = TRUE))
  m2 <- as.matrix(read.csv(f2, header = TRUE))
  cor.test(m1, m2)$estimate
}

# Some toy folders and data
system("mkdir tmpfolder")
system("mkdir tmpfolder2")
set.seed(123)
m1 <- matrix(rnorm(100), nrow=10)
m2 <- matrix(rnorm(100), nrow=10)
cor.test(m1, m2)$estimate
#>         cor 
#> -0.04953215

write.csv(m1, "tmpfolder/f1.csv", row.names = F)
write.csv(m2, "tmpfolder2/f1.csv", row.names = F)

# since names are identical one list of names will suffice
f.names <- list.files("tmpfolder/")

# now apply the function to each file name
lapply(f.names, function(n){get.cor(n, path1 = "tmpfolder/", path2 = "tmpfolder2/")})
#> [[1]]
#>         cor 
#> -0.04953215

Thank you, this is very close - I think instead of read.csv it'd have to be list.files to match a pattern, such as only importing files that match *abc.csv? I've updated my question to reflect the need for the pattern match. — amurphy, Apr 16 '18 at 14:25
if you just edit the list.files line to read `list.files("tmpfolder/", "*abc.csv")` should work. The general principal is get your single list of file names, your two paths to the folders, and then you can iterate along the list of names. — gfgm, Apr 16 '18 at 14:28

Felipe Alvarenga · Answer 2 · 2018-04-16T14:34:54.410

I would first read all files as matrices, then get all correlations using mapply, which is faster and neater.

#read file paths
f1 = list.files("path1", "*.csv")
f2 = list.files("path2", "*.csv")

# order the files so they match each other in both lists
f1 = f1[order(f1)]
f2 = f2[order(f2)]

#load them as matrices
f11 = lapply(f1, function(x) as.matrix(read.csv(x))
f22 = lapply(f2, function(x) as.matrix(read.csv(x))

# generate the correlations
cor_tests = mapply(cor.test, f11, f22)

An example with dummy data

f1 = list(rnorm(100), rnorm(100))
f2 = list(2*rnorm(100), 2*rnorm(100))

ab = mapply(cor.test, f1, f2)
ab[rownames(ab) == "estimate"]
[[1]]
       cor 
-0.1024785 

[[2]]
      cor 
0.1020779

Import and apply function to files with the same name in different directories

2 Answers2