45

I'm trying to write a regex that will parse out the directory and filename of a fully qualified path using matching groups.

so...

/var/log/xyz/10032008.log

would recognize group 1 to be "/var/log/xyz" and group 2 to be "10032008.log"

Seems simple but I can't get the matching groups to work for the life of me.

NOTE: As pointed out by some of the respondents this is probably not a good use of regular expressions. Generally I'd prefer to use the file API of the language I was using. What I'm actually trying to do is a little more complicated than this but would have been much more difficult to explain, so I chose a domain that everyone would be familiar with in order to most succinctly describe the root problem.

Mike Deck
  • 18,045
  • 16
  • 68
  • 92

9 Answers9

53

Try this:

^(.+)\/([^\/]+)$

EDIT: escaped the forward slash to prevent problems when copy/pasting the Regex

lbragile
  • 7,549
  • 3
  • 27
  • 64
Paige Ruten
  • 172,675
  • 36
  • 177
  • 197
  • 1
    Don't you want to make that non-greedy (if this anon regex can handle that) so that it doesn't have to backtrack all that way to the slash? – Axeman Oct 03 '08 at 21:52
  • 13
    This one assumes that there is a path and not just a filename. – Travis Illig Oct 03 '08 at 21:59
  • 5
    It also runs into problems with current directory (.) and root directory (/). The former isn't an issue (fully-qualified pathnames don't start with a dot); the latter might be. The regex also does not handle .. back-traversals - that might be OK because fully-qualified might mean no dot-dot bits. – Jonathan Leffler Oct 03 '08 at 22:48
  • This also works... r'.*/(.*)$', group 0 will return the filename. Since .* is greedy be default, it does all the work. Again assumes there is a path. – Paul Kenjora Dec 02 '17 at 02:34
  • 7
    `^(.+)\/([^\/]+)$` The forward slashes must be escaped? – Neil Agarwal Feb 27 '18 at 19:11
  • 1
    You need to escape the front slashes, but otherwise this answer was just what I needed when trying to solve this question on Answers.Splunk.com - https://answers.splunk.com/answers/777810/how-to-get-a-record-count-of-a-file-under-some-pat.html#answer-776884 – warren Oct 17 '19 at 14:01
31

In languages that support regular expressions with non-capturing groups:

((?:[^/]*/)*)(.*)

I'll explain the gnarly regex by exploding it...

(
  (?:
    [^/]*
    /
  )
  *
)
(.*)

What the parts mean:

(  -- capture group 1 starts
  (?:  -- non-capturing group starts
    [^/]*  -- greedily match as many non-directory separators as possible
    /  -- match a single directory-separator character
  )  -- non-capturing group ends
  *  -- repeat the non-capturing group zero-or-more times
)  -- capture group 1 ends
(.*)  -- capture all remaining characters in group 2

Example

To test the regular expression, I used the following Perl script...

#!/usr/bin/perl -w

use strict;
use warnings;

sub test {
  my $str = shift;
  my $testname = shift;

  $str =~ m#((?:[^/]*/)*)(.*)#;

  print "$str -- $testname\n";
  print "  1: $1\n";
  print "  2: $2\n\n";
}

test('/var/log/xyz/10032008.log', 'absolute path');
test('var/log/xyz/10032008.log', 'relative path');
test('10032008.log', 'filename-only');
test('/10032008.log', 'file directly under root');

The output of the script...

/var/log/xyz/10032008.log -- absolute path
  1: /var/log/xyz/
  2: 10032008.log

var/log/xyz/10032008.log -- relative path
  1: var/log/xyz/
  2: 10032008.log

10032008.log -- filename-only
  1:
  2: 10032008.log

/10032008.log -- file directly under root
  1: /
  2: 10032008.log
Chad Nouis
  • 6,861
  • 1
  • 27
  • 28
12

Most languages have path parsing functions that will give you this already. If you have the ability, I'd recommend using what comes to you for free out-of-the-box.

Assuming / is the path delimiter...

^(.*/)([^/]*)$

The first group will be whatever the directory/path info is, the second will be the filename. For example:

  • /foo/bar/baz.log: "/foo/bar/" is the path, "baz.log" is the file
  • foo/bar.log: "foo/" is the path, "bar.log" is the file
  • /foo/bar: "/foo/" is the path, "bar" is the file
  • /foo/bar/: "/foo/bar/" is the path and there is no file.
Travis Illig
  • 23,195
  • 2
  • 62
  • 85
7

What language? and why use regex for this simple task?

If you must:

^(.*)/([^/]*)$

gives you the two parts you wanted. You might need to quote the parentheses:

^\(.*\)/\([^/]*\)$

depending on your preferred language syntax.

But I suggest you just use your language's string search function that finds the last "/" character, and split the string on that index.

tzot
  • 92,761
  • 29
  • 141
  • 204
  • Many frameworks (e.g. .NET/Python) have methods for separating file names from paths without needing to manually search for the '/' character. This is great because the tools are typically platform-independent. – Jordan Parmer Oct 03 '08 at 21:46
  • 1
    Yes, but he hasn't specified language yet. If it was Python, I would suggest os.path.dirname and os.path.basename . – tzot Oct 04 '08 at 18:47
6

Reasoning:

I did a little research through trial and error method. Found out that all the values that are available in keyboard are eligible to be a file or directory except '/' in *nux machine.

I used touch command to create file for following characters and it created a file.

(Comma separated values below)
'!', '@', '#', '$', "'", '%', '^', '&', '*', '(', ')', ' ', '"', '\', '-', ',', '[', ']', '{', '}', '`', '~', '>', '<', '=', '+', ';', ':', '|'

It failed only when I tried creating '/' (because it's root directory) and filename container / because it file separator.

And it changed the modified time of current dir . when I did touch .. However, file.log is possible.

And of course, a-z, A-Z, 0-9, - (hypen), _ (underscore) should work.

Outcome

So, by the above reasoning we know that a file name or directory name can contain anything except / forward slash. So, our regex will be derived by what will not be present in the file name/directory name.

/(?:(?P<dir>(?:[/]?)(?:[^\/]+/)+)(?P<filename>[^/]+))/

Step by Step regexp creation process

Pattern Explanation

Step-1: Start with matching root directory

A directory can start with / when it is absolute path and directory name when it's relative. Hence, look for / with zero or one occurrence.

/(?P<filepath>(?P<root>[/]?)(?P<rest_of_the_path>.+))/

enter image description here

Step-2: Try to find the first directory.

Next, a directory and its child is always separated by /. And a directory name can be anything except /. Let's match /var/ first then.

/(?P<filepath>(?P<first_directory>(?P<root>[/]?)[^\/]+/)(?P<rest_of_the_path>.+))/

enter image description here

Step-3: Get full directory path for the file

Next, let's match all directories

/(?P<filepath>(?P<dir>(?P<root>[/]?)(?P<single_dir>[^\/]+/)+)(?P<rest_of_the_path>.+))/

enter image description here

Here, single_dir is yz/ because, first it matched var/, then it found next occurrence of same pattern i.e. log/, then it found the next occurrence of same pattern yz/. So, it showed the last occurrence of pattern.

Step-4: Match filename and clean up

Now, we know that we're never going to use the groups like single_dir, filepath, root. Hence let's clean that up.

Let's keep them as groups however don't capture those groups.

And rest_of_the_path is just the filename! So, rename it. And a file will not have / in its name, so it's better to keep [^/]

/(?:(?P<dir>(?:[/]?)(?:[^\/]+/)+)(?P<filename>[^/]+))/

This brings us to the final result. Of course, there are several other ways you can do it. I am just mentioning one of the ways here.

enter image description here

Regex Rules used above are listed here

^ means string starts with
(?P<dir>pattern) means capture group by group name. We have two groups with group name dir and file
(?:pattern) means don't consider this group or non-capturing group.
? means match zero or one. + means match one or more [^\/] means matches any char except forward slash (/)

[/]? means if it is absolute path then it can start with / otherwise it won't. So, match zero or one occurrence of /.

[^\/]+/ means one or more characters which aren't forward slash (/) which is followed by a forward slash (/). This will match var/ or xyz/. One directory at a time.

theBuzzyCoder
  • 2,652
  • 2
  • 31
  • 26
  • a file/directory name in most (if not all) filesystems that originated in a *nix environment accept all byte values except '/' and '\0'. – tzot Jun 17 '20 at 15:53
2

What about this?

[/]{0,1}([^/]+[/])*([^/]*)

Deterministic :

((/)|())([^/]+/)*([^/]*)

Strict :

^[/]{0,1}([^/]+[/])*([^/]*)$
^((/)|())([^/]+/)*([^/]*)$
1

A very late answer, but hope this will help

^(.+?)/([\w]+\.log)$

This uses lazy check for /, and I just modified the accepted answer

http://regex101.com/r/gV2xB7/1

0

Try this:

/^(\/([^/]+\/)*)(.*)$/

It will leave the trailing slash on the path, though.

Lucas Oman
  • 15,597
  • 2
  • 44
  • 45
0

Given an example upload folder URL:

https://drive.google.com/drive/folders/14Q6d-KiwgTKE-qm5EOZvHeX86-Wf9Q5f?usp=sharing

The regular expression pattern is:

[-\w]{25,}   

This pattern also works in Google Sheets as well as custom functions in Excel:

=REGEXEXTRACT(N2,"[-\w]{25,}")

The result is: 14Q6d-KiwgTKE-qm5EOZvHeX86-Wf9Q5f

enter image description here

enter image description here

enter image description here