0

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.

I am using VB.Net

Example string:

"Hello world this is a string C:\Example\Test E:\AnotherExample"

Pattern:

"[A-Z]{1}:.+?[^ ]*"

Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.

Hope i have made sense.

Ahmed KRAIEM
  • 10,267
  • 4
  • 30
  • 33
  • 3
    Should it be able to handle something like, "This is a string C:\program files\test D:\test and this is another string"? Because any strings at the end would be - as far as I can tell - impossible to tell from a directory with spaces. – Gray Jun 21 '13 at 13:05
  • You ask the impossible. Assuming these paths relate to the local file system, you'd need to test successively longer candidates to ensure that they are directories... otherwise there's no way to resolve the ambiguity of successive words that do or do not form part of a path. – spender Jun 21 '13 at 13:17

2 Answers2

0

How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?

That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.

Community
  • 1
  • 1
Jon
  • 309
  • 3
  • 10
  • ... actually, Roy Osherove's Regulator (http://www.webresourcesdepot.com/learn-test-regular-expressions-with-the-regulator/) is telling me that the above will match the drive letters but nothing else besides, which I don't quite understand at this point. – Jon Jun 21 '13 at 13:27
  • ...and of course, @Gray is correct that any successive words would be assumed to be a part of the last path. Paths containing spaces are often quote delimited for just that reason - a space in a file path is indistinguishable from a space after a file path :-) – Jon Jun 21 '13 at 13:39
0

Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.

The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

bmm6o
  • 6,187
  • 3
  • 28
  • 55