31

I am trying to obtain the list of files matching a full-path pattern. So far, I have used list.files() but it did not work.

Let's assume that we have the following directory organization:

results
   |- A
   |  |- data-1.csv
   |  |- data-2.csv
   |
   |- B
      |- data-1.csv
      |- data-2.csv

Then the following command:

list.files(pattern='data-.*\\.csv', recursive=TRUE)

will return all the files matching the pattern. This works, but the problem appears when using a full-path pattern. For instance, if I want to obtain all the CSV files from directory results/A, I could do:

list.files(pattern='results/A/data-.*\\.csv', recursive=TRUE)

This does not work, though. Somehow, it seems like R is not able to use a full-path pattern as a regular expression. In this case, the solution could be to just use results/A as the base path. But in more complex problems, that cannot be done. For instance, at some point we may want to match the subdirectories containing only characters:

list.files(pattern='results/[A-Z]+/data-.*\\.csv', recursive=TRUE)

Is it possible to do this in R?

UPDATE: After using ad hoc solutions for a while, I decided to stop typing the same again and again. So, I created a library for simplifying this task.

betabandido
  • 18,946
  • 11
  • 62
  • 76

4 Answers4

38

First, note that you are not using regular expression patterns. Your first example should be:

list.files(pattern='data-.*\\.csv', recursive=TRUE)

Then, it seems the pattern matching inside list.files is applied to the file basenames (i.e., not including the directory path) so you could split the task into:

  1. Find all files matching the basename only, return their full paths:

    basename.matches <- list.files(pattern='data-.*\\.csv', recursive=TRUE,
                                   full.names = TRUE)
    basename.matches
    # [1] "./results/A/data-1.csv" "./results/A/data-2.csv" "./results/B/data-1.csv"
    # [4] "./results/B/data-2.csv"
    
  2. Keep only those that match the expected directory(ies):

    full.matches <- grep(pattern='^\\./results/A/', basename.matches, value = TRUE)
    full.matches
    # [1] "./results/A/data-1.csv" "./results/A/data-2.csv"
    
flodel
  • 87,577
  • 21
  • 185
  • 223
  • You are totally right. Thank you for spotting that. Your two-step solution is similar to what I was planning to do if, as it seems it is the case, there is no support for full-path patterns in R. However, if the regular expression points to an absolute path, list.files will not work. For instance, '/tmp/[A-Z]+/data-.*\\.csv'. I guess I can always extract the beginning of the path '/tmp/' and use that as the 'path' parameter for list.files, but I was wondering if R already provides something like that. – betabandido Apr 27 '12 at 16:10
7

You cannot do this with only list.files because it loops over each element in path and applies the regular expression to the files contained therein. But since the path argument to list.files can accept a vector, you can use that to solve your problem.

dirs <- grep("[A-Z]+$",list.dirs("results",recursive=FALSE),value=TRUE)
list.files(dirs, "data-.*\\.csv", recursive=TRUE, full.names=TRUE)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
1

I think there is an even simpler solution:

Sys.glob(file.path(results, "[A-Z]", "data-*.csv"))

michael
  • 371
  • 3
  • 12
1

I will use

paths <- list.files(results, pattern= glob2rx("*data-*.csv$*"), full.names=T, recursive=T)
mga302
  • 11
  • 1