0

I've implemented a solution that uses Quartz to read a folder in a interval of time and for each file, it does some operations and deletes the file when it finish. It is smooth when i don't have thousand files in directory.

getFiles(config.getString("input")) match {
  case Some(files) =>
    files.foreach { file =>
      try {
        // check if file is in  use
        if (file.renameTo(file)) {
          process(file, config)
        }
      } catch {
        case e: Exception =>
      } finally {
        ...
      }
    }
  case None =>
    ...
}

def getFiles(path: String): Option[Array[File]] = {
new File(path).listFiles() match {
  case files if files != null =>
    Some(files.filter(file => file.lastModified < Calendar.getInstance.getTimeInMillis - 5000))
  case _ =>
    None
}
}

def process(file: File, clientConfig:Config) {
  ...
  file.delete
}

Now my scenario is different - i'm working with thousand and thousand files - and my throughput is very slow: 50/sec (each file has 40kb).

I was wondering what is the best approach to process many files. Should I replace the method getFile() to return N elements and apply a FileLock on each element? If I use FileLock, I could to retrieve only the elements that are not in use. Or should i use something from Java NIO?

Thank in advance.

  • Add some metrics to understand what functions take most of time. I suppose `process` is your bottleneck. Probably multithreading file processing can help. – vvg Nov 12 '15 at 10:36

1 Answers1

0

I think you can wrap your try catch block in a Future, so you can process the files in parallel. Apparently using an Execution Context backed by a cached threadpool is best for IO bound operations. This would also mean you do not need to worry about locks, as you spawn a future for each file synchronously.

You could also read the input files as a stream, which would mean your code would no longer store a reference to all the files in memory upfront, but instead only store a reference to the working set (one file), but I don't think this is the cause of your bottleneck.

Community
  • 1
  • 1
rahilb
  • 726
  • 3
  • 15