0

I've been trying to come up with a regular expression that would filter out all valid Unix paths from a given text but would not match any URL (such as http://...)

The following paths are all valid:

/home/username/some_file.txt
/home/username/some_file.longext
"/path/to/file/some file.longext"

But it should not match any of these:

http://www.somelink.com
ftp://www.somelink.co.uk
https://www.somelink.com and so on

I came up with this, but it matches all URLs too, which is something I'm trying to filter out:

"?[a-zA-Z0-9\/].*\.[a-zA-Z0-9].*"?

EDIT: I should mention the input text is actually contents from a file with URLs inside as well as valid Unix Paths so the regex needs to be able to match on any path anywhere inside the text apart from matching URLs.

strange
  • 9,654
  • 6
  • 33
  • 47
  • 2
    `[a-zA-Z0-9]` isn't enough to match UNIX paths - the only characters you can't use in a POSIX path are `/` and the null character. – Carl Norum Aug 12 '12 at 18:14
  • an answer here: http://stackoverflow.com/questions/537772/what-is-the-most-correct-regular-expression-for-a-unix-file-path then what you don't want to find is here: http://tools.ietf.org/html/rfc3986#appendix-B so you need to check the first thing, then discard url. – N4553R Aug 12 '12 at 18:15
  • I did have a look at that question but the accepted regex did not work at all for the given text I have to filter on – strange Aug 12 '12 at 18:25
  • What about a `file:///path/to/file` URL? – Jonathan Leffler Aug 12 '12 at 19:35
  • yes that's fine because it does not begin with http/https/ftp etc – strange Aug 12 '12 at 21:40

2 Answers2

2

You should be aware that any solution you come up with will only be a heuristic.

cd /tmp
mkdir test
cd test
mkdir http:
cd http:
mkdir www.google.com
cd www.google.com
echo "I'm a file, not a web site" > 'search?q=Unix+path+syntax+double+slash'
cd /tmp/test

And now http://www.google.com/search?q=Unix+path+syntax+double+slash is both: a URL and a path to a file:

cat 'http://www.google.com/search?q=Unix+path+syntax+double+slash'
w3m 'http://www.google.com/search?q=Unix+path+syntax+double+slash'

The only solid way to know what's a pathname and what isn't a pathname is through context. An argument to cat is a pathname. An argument to w3m isn't. In free-form text, without parsing the writer's native language, you're guessing.

Alan Curry
  • 14,255
  • 3
  • 32
  • 33
  • Actually an argument to `w3m` is a pathname, sometimes. It does some guessing itself. I used `wget` instead of `w3m` in the original answer, but Google bans `wget` so I changed it in a hurry, and ended up with this mess. – Alan Curry Aug 12 '12 at 19:42
  • That's all fine guys, I'm okay with heuristic since I know the files will be log files and they cannot possibly refer to path names that are actually fake url addresses. I just need the regex to be able to distinguish between the two. In short I need regex that filters out any path that begins with http/ftp/https etc. – strange Aug 12 '12 at 21:39
0

It seems as simple as trying to match a slash at the beginning of the string, assuming that your paths are absolute and that there is no need to check if path exists, it's readable or similar. It should begin like ^"?/. That will be enought to filter out URLs.

Birei
  • 35,723
  • 2
  • 77
  • 82
  • But that does not match spaces within filenames and does not match on this for example: /Users/Me/Desktop/Path/SomeMore/Screen shot 2011-03-15 at 20.38.21.png – strange Aug 12 '12 at 18:24
  • It's the beginning of the regular expression. – Birei Aug 12 '12 at 18:29