1

I'm trying to get some files with a very simple regular expression using list.files.

files <- list.files('C:/filepath/...', pattern = ('split*.csv'), full.names=TRUE)

I have some files in that specific folder:

split1.csv
split2.csv
split3.csv
...

This code is supposed to work, according to a lot of examples I saw. But when I run it with the pattern pattern = ('split*.csv') I get an empty 'list' back.

However, when I run it with pattern = ('split1.csv'), it matches with the file split1.csv.

When I run it with the pattern pattern = ('*.csv'), it also works fine: It matches all split files, but of course also the other csv files in the folder.

So, the problem is not that the files do not exist in this folder. The filepath is correct, but with pattern = ('split*.csv'), it does not match the above split files. This should be, given the many examples I have seen of this.

Could it be true that something has changed about this function? Does anyone know how to filter for the right files?

Working with R Version 3.6.1.

Hart Radev
  • 361
  • 1
  • 10
jordinec
  • 75
  • 5
  • `'split.*\\.csv'` In regular expressions `.` is a special symbol. see https://stackoverflow.com/questions/4736/learning-regular-expressions – jogo Nov 08 '19 at 09:19
  • Possible duplicate https://stackoverflow.com/questions/27721008/how-do-i-deal-with-special-characters-like-in-my-regex – Ronak Shah Nov 08 '19 at 09:32

3 Answers3

2

you are not using the correct regex. . is a placeholder that will match every character. To match an actual dot you should use \\..

list.files('C:/filepath/...', pattern = 'split.\\.csv$', full.names=TRUE)

* in regex language means zero or more occurrences of the last character. Therefore, pattern = 'split*.csv' will match any of these: spli.csv, split.csv, splitttttt.csv, splittttttHcsv, spli9csv and so on.

You can use grep to test your regex:

pos_match <- c("spli.csv", "split.csv", "splittttt.csv", "spli9csv", "splittttHcsv")
neg_match <- c("split1.csv",  "split2.csv")
grep("split*.csv", pos_match, value = T)
grep("split*.csv", neg_match, value = T)
Cettt
  • 11,460
  • 7
  • 35
  • 58
1

It is due to your regex:

a <- c('split1.csv', 'split2.csv', 'split3.csv')
grep('split.*\\.csv', a)

In R you need a dot (.) before the star (*) to match any number of characters.

And if you want to specify that there is an actual dot (like in '.csv') or any other character that does not represent its own symbol in regex, you have to specify that you actually want that specific character by adding '\\' before it.

Hart Radev
  • 361
  • 1
  • 10
0

When you are using split* it can mean splittttt, spli, split. I think split[0-9]*.csv should help. Considering your file names have numbers after split.

msian09
  • 13
  • 6