Parsing a string to extract a URL or folder path

Question

I asked a similar question recently about using regex to retrieve a URL or folder path from a string. I was looking at this comment by Dour High Arch, where he says:

"I recommend you do not use regexes at all; use separate code paths for URLs, using the Uri class, and file paths, using the FileInfo class. These classes already handle parsing, matching, extracting components, and so on."

I never really tried this, but now I am looking into it and can't figure out if what he said actually is useful to what I'm trying to accomplish.

I want to be able to parse a string message that could be something like:

"I placed the files on the server at http://www.thewebsite.com/NewStuff, they can also be reached on your local network drives at J:\Downloads\NewStuff"

And extract out the two strings http://www.thewebsite.com/ and J:\Downloads\NewStuff. I don't see any methods on the Uri or FileInfo class that parse a Uri or FileInfo object from a string like I think Dour High Arch was implying.

Is there something I'm missing about using the Uri or FileInfo class that will allow this behavior? If not is there some other class in the framework that does this?

I think what that comment was implying is that if you pass it a file path, etc. It should correct back-slashes for forward-slashes and so forth. I myself would use Regex. — Squirrel5853, Oct 07 '13 at 16:39
You were missing Uri.IsWellFormedUriString will return the type of the Uri (including file paths and URLs). This is the matching mentioned. see http://msdn.microsoft.com/en-us/library/system.uri.iswellformeduristring.aspx — Hogan, Oct 07 '13 at 16:49
@Hogan If I'm only passing in a string that is a Uri that would be fine. However, I'm asking if there is a method of the Uri or FileInfo class that can accept a string such as the one in the example, and retrieve a URI or Filepath from that string without any further work... — Zack, Oct 07 '13 at 16:52

CSharpie · Answer 1 · 2013-10-07T16:49:25.173

1

I'd say the easiest way is splitting the strings into parts first.

First delimiter would be spaces, for each word - second would be qoutes (double and single)

Then use Uri.IsWellFormedUriString on each token.

So something like:

foreach(var part in String.Split(new char[]{''', '"', ' '}, someRandomText))
{
    if(Uri.IsWellFormedUriString(part, UriKind.RelativeOrAbsolute))
        doSomethingWith(part);

}

Just saw at URI.IseWellFormedURIString that this is a bit to strickt to suit your needs maybe. It returns false if www.Whatever.com is missing the http://

edited Oct 07 '13 at 16:49

answered Oct 07 '13 at 16:44

CSharpie

9,195
4
44
71

1

I guess you have to trim other punctuations such as commas, periods, exclamation and question marks, colons, semi-colons and so on,as well, right? – Jerry Oct 07 '13 at 16:45
1

yeah maybe it would be better to use Regex.Matches instead of string.Split. So define a more lose regex of what could be a path, then use uri.IsWellFormedUriString to make sure it is. – CSharpie Oct 07 '13 at 16:46

Sedecimdies · Answer 2 · 2013-10-08T02:58:56.670

U can use :

(?<type>[^ ]+?:)(?<path>//[^ ]*|\\.+\\[^ ]*)

that will give you 2 groups on each result

type : "http:"

path : //www.thewebsite.com/NewStuff

and

type : "J:"

path : \Downloads\NewStuff

out of the string

"I placed the files on the server at http://www.thewebsite.com/NewStuff, they can also be reached on your local network drives at J:\Downloads\NewStuff"

you can use the "type" group to see if the type is http:or not and set action on that.

EDIT

or use regex below if you are sure there is no whitespace in your filepath :

(?<type>[^ ]+?:)(?<path>//[^ ]*|\\[^ ]*)

score 1 · Answer 3 · edited May 23 '17 at 10:25

It was not clear from your earlier question that you wanted to extract URL and file path substrings from larger strings. In that case, neither Uri.IsWellFormedUriString nor rRegex.Match will do what you want. Indeed, I do not think any simple method can do what you want because you will have to define rules for ambiguous strings like httX://wasThatAUriScheme/andAre/these part/of/aURL or/are they/separate.strings?andIsThis%20a%20Param?

My suggestion is to define a recursive descent parser and create states for each substring you need to distinguish.

score -1 · Answer 4 · answered Oct 08 '13 at 17:19

-1

Try \w+:\S+ and see how well that fits your purposes.

answered Oct 08 '13 at 17:19

Michael Dyck

2,153
1
14
18

Parsing a string to extract a URL or folder path

4 Answers4

Linked