I am trying to parse urls and filepaths from files using Python. I already have a url regex.
Issue
I want a regex pattern that extracts file paths from a string. Requirements:
- exclusive (does not include urls)
- OS-independent, i.e. Windows and UNIX style paths e.g. (
C:\
,\\
,/
) - all path types, i.e. absolute & relative paths e.g. (
/
,../
)
Please assist by modifying my attempt below or suggesting an improved pattern.
Attempt
Here is the regex I have so far:
(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)[\w+\\\s_\(\)\/]+(?:\.\w+)*
Description
(?:[A-Z]:|\\|(?:\.{1,2}[\/\\])+)
: any preceding drive letter, backslash or dotted path[\w+\\\s_\(\)\/]+
: any path-like characters - alphanumerics, slashes, parens, underscores, ...(?:\.\w+)*
: optional extension
Result
Note: I have confirmed these results in Python using an input list of strings and the re
module.
Expected
This regex satisfies most of my requirements - namely excluding urls while extracting most file paths. However, I would like to match all paths (including UNIX-style paths that begin with a single slash, e.g. /foo/bar.txt
) without matching urls.
Research
I have not found a general solution. Most work tends to satisfy specific cases.
SO Posts
- How to write a regex to match multiple file path
- Regex for extracting filename from path
- regex for finding file paths
- Python regular expression for Windows file path
External sites