7

I am trying to parse urls and filepaths from files using Python. I already have a url regex.

Issue

I want a regex pattern that extracts file paths from a string. Requirements:

  • exclusive (does not include urls)
  • OS-independent, i.e. Windows and UNIX style paths e.g. (C:\, \\, /)
  • all path types, i.e. absolute & relative paths e.g. (/, ../)

Please assist by modifying my attempt below or suggesting an improved pattern.

Attempt

Here is the regex I have so far:

(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)[\w+\\\s_\(\)\/]+(?:\.\w+)*

Description

  • (?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+): any preceding drive letter, backslash or dotted path
  • [\w+\\\s_\(\)\/]+: any path-like characters - alphanumerics, slashes, parens, underscores, ...
  • (?:\.\w+)*: optional extension

Result

enter image description here

Note: I have confirmed these results in Python using an input list of strings and the re module.

Expected

This regex satisfies most of my requirements - namely excluding urls while extracting most file paths. However, I would like to match all paths (including UNIX-style paths that begin with a single slash, e.g. /foo/bar.txt) without matching urls.

Research

I have not found a general solution. Most work tends to satisfy specific cases.

SO Posts

External sites

pylang
  • 40,867
  • 14
  • 129
  • 121
  • You could match the preceding character if it is going to be portable. You shouldn't use non-capturing groups either. Try this https://regex101.com/r/IsmBeL/8 – revo Mar 04 '19 at 19:54
  • And check this for Python https://regex101.com/r/IsmBeL/10 – revo Mar 04 '19 at 19:59
  • Or perhaps add another alternation with a negative lookbehind to match the first 2 paths https://regex101.com/r/5Dyith/1 – The fourth bird Mar 04 '19 at 20:00
  • 1
    Well, this is going to be fun. `command.com` is literally a filename and an internet host. – melpomene Mar 04 '19 at 20:00
  • @melpomene It doesn't have an scheme? It's not a URL. – revo Mar 04 '19 at 20:04
  • @revo OP lists `www.google.com` under "URLs". – melpomene Mar 04 '19 at 20:05
  • 2
    To match that a **file** name is valid in UNIX you do this: `'\0' not in filename and filename[-1] != '/'`. The **only** limitation is that the filename cannot include `\0` and a file cannot contain `/` in its *name* (obviously its absolute path will contain `/`s). (I might add that using normal APIs you really cannot include `/` in the name part of the filename except as placing it at the end of the name... in other positions it will be interpreted as separator in the path). – Bakuriu Mar 04 '19 at 20:06
  • @melpomene But at least never listed `google.com`. – revo Mar 04 '19 at 20:06
  • @revo `mkdir https && touch https://www.example.com` – melpomene Mar 04 '19 at 20:07
  • @melpomene Sorry I didn't get your point. – revo Mar 04 '19 at 20:09
  • @revo: What melpomene and Bakuriu are saying is, `https://www.example.com` is a valid filename. For that matter, `is a valid filename` is a valid filename. There is no way to find "filenames" in text without testing virtually every substring against a filesystem for existence. – Amadan Mar 05 '19 at 05:47
  • @Amadan So they were trying to say `mkdir https: ...`. Well to me it is over-thinking and over-complicating things which usually happens. Which UNIX utility does output double slashes as part of the path? None. So this could be handled with no further sophistication. BTW, I agree with you. I never said a regex is able to sense filenames. – revo Mar 05 '19 at 06:12
  • @revo Yours works in many of my tests. Can you post an answer explaining your lookbehind? See my updated tests https://regex101.com/r/IsmBeL/26. Can you resolve the remaining issues? – pylang Mar 08 '19 at 04:17
  • @Thefourthbird Can you also post an answer? See my updated tests. https://regex101.com/r/5Dyith/2 – pylang Mar 08 '19 at 04:18

1 Answers1

3

You could split the problem in 3 alternative patterns: (note that I didn't implement all character exclusions for path/file names)

  • Non-quoted Windows paths
  • quoted Windows paths
  • unix paths

This would give something like this:

((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))|("\2([^%\/\|:\n\"]*)")|((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)

Broken down:

Wind-Non-Quoted: ((((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))([^\b%\/\|:\n\"]*))
Wind-Quoted:     ("\2([^%\/\|:\n\"]*)")
Unix:            ((?<!\w)(\.{1,2})?(?<!\/)(\/((\\\b)|[^ \b%\|:\n\"\\\/])+)+\/?)


Wind-Non-Quoted:
    prefix: (((?<!\w)[A-Z,a-z]:)|(\.{1,2}\\))
         drive: ((?<!\w)[A-Z,a-z]:) *Lookback to ensure single letter*
      relative: (\.{1,2}\\))
      path: ([^\b%\/\|:\n\"]*))     *Excluding invalid name characters (The list is not complete)*

Wind-Quoted:
    prefix: \2                *Reuses the one from non-Quoted*
      path: ([^%\/\|:\n\"]*)  *Save as above but does not exclude spaces*

Unix:
    prefix: (?<!\w)(\.{1,2})?                . or .. not preceded by letters
      path: (?<!\/)                          repeated /name (exclusions as above)
            (\/((\\\b)|[^ \b%\|:\n\"\\\/])+) not preceded by /
            \/?                              optionally ending with /

            *(excluding the double slashes is intended to prevent matching urls)*
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • I appreciate the work. Your approach does match a majority of my tests (https://regex101.com/r/qFDLwB/1/). However, most are doing multiple captures. Instead, I think you need multiple *non-capture* groups and one capture group to extract a cohesive file path. See alternative: https://regex101.com/r/IsmBeL/26. Also, are you able to resolve the remaining edge cases? – pylang Mar 08 '19 at 04:27