21

I have a data.frame that contains a text column of file names. I would like to return the file name without the path or the file extension. Typically, my file names have been numbered, but they don't have to be. For example:

df<-data.frame(data=c("a","b"),fileNames=c("C:/a/bb/ccc/NAME1.ext","C:/a/bb/ccc/d D2/name2.ext"))

I would like to return the equivalent of

df<-data.frame(data=c("a","b"),fileNames=c("NAME","name"))

but I cannot figure out the slick regular expression to do this with gsub. For example, I can get rid of the extension with (provided the file name ends with a number):

gsub('([0-9]).ext','',df[,"fileNames"])

Though I've been trying various patterns (by reading the regex help files and similar solutions on this site), I can't get a regex to return the text between the last "/" and the first ".". Any thoughts or forwards to similar questions are much appreciated!

The best I have gotten is:

 gsub('*[[:graph:]_]/|*[[:graph:]_].ext','',df[,"fileNames"])

But this 1) doesn't get rid of all the leading path characters and 2) is dependent on a specific file extension.

Docuemada
  • 1,703
  • 2
  • 25
  • 44

2 Answers2

40

Perhaps this will get you closer to your solution:

library(tools)
basename(file_path_sans_ext(df$fileNames))
# [1] "NAME1" "name2"

The file_path_sans_ext function is from the "tools" package (which I believe usually comes with R), and that will extract the path up to (but not including) the extension. The basename function will then get rid of your path information.

Or, to take from file_path_sans_ext and modify it a bit, you can try:

sub("(.*\\/)([^.]+)(\\.[[:alnum:]]+$)", "\\2", df$fileNames)
# [1] "NAME1" "name2"

Here, I've "captured" all three parts of the "fileNames" variables, so if you wanted just the file paths, you would change "\\2" to "\\1", and if you wanted just the file extensions, you would change it to "\\3".

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Interesting approach. For me, this approach is more clear than the regex, which is currently kind of confusing for me. I'll give it a try. – Docuemada Feb 25 '13 at 18:52
  • This worked well, thank you. It makes more sense to me, but that's probably because I need more practice with regex! – Docuemada Feb 25 '13 at 19:10
  • @Docuemada, no problem. As shown, `file_path_sans_ext` is a basic regular expression, as I suspect `basename` is (but haven't checked to verify). – A5C1D2H2I1M1N2O1R2T1 Feb 25 '13 at 19:13
  • Yes! the sub("(.*\\/)([^.]+)(\\.[[:alnum:]]+$)", "\\2", df$fileNames) was what I was after. Thanks to you and zipfzapf for the quick and informative responses. – Docuemada Feb 25 '13 at 19:52
11

First of all, to get rid of the "leading path", you can use basename. To remove the extension, you can use sub similar to your description in your question:

filenames <- sub("\\.[[:alnum:]]+$", "", basename(as.character(df$fileNames)))

Note that you should use sub instead of gsub here, because the file extension can only occur once for each filename. Also, you should use \\. which matches a dot instead of . which matches any symbol. Finally, you should append $ to the pattern to make sure you are removing the extension only if it is at the end of the filename.

Edit: the function file_path_sans_ext suggested in Ananda Mahto's solution works via sub("([^.]+)\\.[[:alnum:]]+$", "\\1", x), i.e. instead of removing the extension as above, the non-extension part of the filename is retained. I can't see any specific advantages or disadvantages of both methods in the OP's case.

QkuCeHBH
  • 960
  • 1
  • 9
  • 23
  • 1
    You probably need to use an `as.character` around `df$fileNames` if it has been read in as a factor, as in the example data provided. – A5C1D2H2I1M1N2O1R2T1 Feb 25 '13 at 18:40
  • Thank you, and thank you for explaining the regexp characters. This works well. For this example, I used ...as.character(df$fileNames). – Docuemada Feb 25 '13 at 19:04