3

I have over 10000 csv files and I need to do a Fast Fourier Transformation on each column of each csv file. I have access to 1000 cores. What should be the fastest way?

Currently I have a for loop reading each file sequentially and using the apply(data, 2, FFT) function. How would I do so? I tried doing clusterapply(1:10000, cl, transformation). In the transformation function, I have read csv. It still takes a long time to do all the reading. Do anyone of you know a faster way?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • This depends more on your disk storage system/setup than R itself, and all the CPUs in the world won't help if your data has to be accessed serially. Is your data distributed on a cluster _storage_ system? – mmuurr Nov 21 '14 at 00:03
  • Instead of having each thread reading a file, have you tried processing the columns within a file in parallel? It could be as simple as `mclapply(data, fft)`. – Neal Fultz Nov 21 '14 at 00:28
  • 5
    fread in data.tables is much faster than read.csv – baptiste Nov 21 '14 at 00:30
  • my data is in a high performance computer. So I hope the data is not being accesses serially. How would I know if it is or not? – user3554354 Nov 21 '14 at 00:43
  • @NealFultz, can you elaborate a little more, so I understand what you are saying exactly. I am currently using 15 clusters. Each cluser will read a file. Once it reads the file it will then do an apply fft function onto each column. at the end the 15 clusters join the file, and I go onto the next set of 15 files. This is being doing sequentially (every 15 files). Can this be avoided? – user3554354 Nov 21 '14 at 00:45
  • You should post working code for parallel apply's on your machine. Then people can suggest improvements if they exist. I do think you need to investigate the `fread` suggestion. There's no reason I can think of that would prevent its distribution. – IRTFM Nov 21 '14 at 01:31

1 Answers1

4

I would think that the fastest way would be to mclapply and fread.

#Bring in libraries
library(parallel)
library(data.table)

#Find all csv files in your folder
csv.list = list.files(pattern="*.csv")

#Create function to read in data and perform fft on each column
read.fft <- function(x) {
    data <- fread(x)
    result <- data[, lapply(.SD,fft)]
return(result)
}

#Apply function using multiple cores
all.results <- mclapply(csv.list,read.fft,mc.cores=10)

If it makes sense for you to take a random sample of each dataset, I would highly suggest changing the read.fft function to use the shuf command. It will spend up your read-in time by quite a bit.

#Create function to read in data and perform fft
read.fft <- function(x) {
    data <- fread(paste0("shuf -n 10000",x)) #Takes random sample of 10000 rows
    result <- data[, lapply(.SD,fft)]
return(result)
}
Mike.Gahan
  • 4,565
  • 23
  • 39