6

Good morning guys

Is there a good way to use regular expression in C# in order to find all filenames and their paths within a string variable?

For example, if you have this string:

string s = @"Hello John

these are the files you have to send us today: <file>C:\Development\Projects 2010\Accounting\file20101130.csv</file>, <file>C:\Development\Projects 2010\Accounting\orders20101130.docx</file>

also we would like you to send <file>C:\Development\Projects 2010\Accounting\customersupdated.xls</file>

thank you";

The result would be:

C:\Development\Projects 2010\Accounting\file20101130.csv
C:\Development\Projects 2010\Accounting\orders20101130.docx
C:\Development\Projects 2010\Accounting\customersupdated.xls

EDITED: Considering what told @Jim, I edited the string adding tags in order to make it easier to extract needed file names from string!

Junior Mayhé
  • 16,144
  • 26
  • 115
  • 161
  • What are your results so far? –  Sep 25 '10 at 10:26
  • Should files exists locally or be just well-formed file paths? – abatishchev Sep 25 '10 at 10:27
  • How would you differentiate between a file named **file20101130.csv** and a file named **file20101130.csv, C**? Both whitespace and commas are allowed in file name extensions, so no luck there - you'd have to come up with some constraints on filenames for that to work, i.e. disallow spaces, limit the length of extensions etc. – Jim Brissom Sep 25 '10 at 10:28
  • @Jim, if you mean to add some sort of special characters like "filename" quotes or filename, yes.. I agree with your point – Junior Mayhé Sep 25 '10 at 10:35
  • @abatishchev it is not necessary to verify if files exist locally – Junior Mayhé Sep 25 '10 at 10:35
  • No, my point is: **20100101KCLIENT.data set** is a perfectly valid filename. There is no way you can extract this with pure regex if you allow all valid filename extensions that the file system supports. – Jim Brissom Sep 25 '10 at 10:40
  • OK @Jim. We know how difficult customers are! LOL. So filenames would come with different naming formats, not always 20100101XXXX.xls but also "hello mama.xls". – Junior Mayhé Sep 25 '10 at 10:47

3 Answers3

6

Here's something I came up with:

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main()
    {
        string s = @"Hello John these are the files you have to send us today: 
            C:\projects\orders20101130.docx also we would like you to send 
            C:\some\file.txt, C:\someother.file and d:\some file\with spaces.ext  

            Thank you";

        Extract(s);

    }

    private static readonly Regex rx = new Regex
        (@"[a-z]:\\(?:[^\\:]+\\)*((?:[^:\\]+)\.\w+)", RegexOptions.IgnoreCase);

    static void Extract(string text)
    {
        MatchCollection matches = rx.Matches(text);

        foreach (Match match in matches)
        {
            Console.WriteLine("'{0}'", match.Value);
        }
    }

}

Produces: (see on ideone)

'C:\projects\orders20101130.docx', file: 'orders20101130.docx'
'C:\some\file.txt', file: 'file.txt'
'C:\someother.file', file: 'someother.file'
'd:\some file\with spaces.ext', file: 'with spaces.ext'

The regex is not extremely robust (it does make a few assumptions) but it worked for your examples as well.


Here is a version of the program if you use <file> tags. Change the regex and Extract to:

private static readonly Regex rx = new Regex
    (@"<file>(.+?)</file>", RegexOptions.IgnoreCase);

static void Extract(string text)
{
    MatchCollection matches = rx.Matches(text);

    foreach (Match match in matches)
    {
        Console.WriteLine("'{0}'", match.Groups[1]);
    }
}

Also available on ideone.

Aillyn
  • 23,354
  • 24
  • 59
  • 84
  • Your code is really working here. I also have tested, adding extra whitespace in "file 20101130.csv". Thank you Aillyn! – Junior Mayhé Sep 25 '10 at 10:58
  • @Aillyn: Does not deal with Jim Brissom's comment (see comments on op). It also does not take into account that paths can be deeper than just one directory and that the file extensions can contain spaces. – AxelEckenberger Sep 25 '10 at 11:01
  • @Junior I've added a version of the regex that uses `` tags. – Aillyn Sep 25 '10 at 11:01
  • @Obalix True, that is why I said it does make a few assumptions (paths deeper than one directory work fine though, and it wouldn't be hard to add whitespaces to the extensions - not that I've seen files like that). But I agree that using tags would be a better idea – Aillyn Sep 25 '10 at 11:01
  • @Junior Mayhé: The code does work, only under certain circumstances. If you can guarantee that the files will always be in the following format it is ok: `c:\directory\filename.ext`, it does not work for: `c:\directory\directory\filename.ext`, nor for `c:\directory\file name with space.ext with space`, nor for `c:\directory\filename.ext1.ext2`. – AxelEckenberger Sep 25 '10 at 11:04
  • @Obalix, Hi there. I tested Aillyn's code with both cases: C:\Development\Projects2010\Accounting\file 20101130.csv and C:\Development\Projects 2010\Accounting\file 20101130.csv. Notice there is a white space in Projects 2010, it is a subfolder. – Junior Mayhé Sep 25 '10 at 11:13
  • @Aillyn indeed it is cleanner when we use a tag! – Junior Mayhé Sep 25 '10 at 11:14
  • @Junior I've updated my answer with a more robust regex. And now it's also capable of capturing the file name. It still doesn't support extensions with spaces because I have never seen files like that. – Aillyn Sep 25 '10 at 11:20
  • Mee too @Aillyn, but the code is ok for searching filenames in string variable. I am your fan now :-) I didn't put attention to @Obalix thinking about extracting C:\Directory\Sub Directory\That another directory\Those.Namespace.FileName.txt. But your expression now works beautifully – Junior Mayhé Sep 25 '10 at 11:27
  • Don't parse (X)HTML using RegEx! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – abatishchev Sep 27 '10 at 06:29
  • @abatis Read the question carefully. If the OP follows the convention of using a tag only for the files, the result is a regular language, which *can* be parsed by a regular expression. – Aillyn Sep 27 '10 at 15:49
4

If you put some constraints on your filename requirements, you can use code similar to this:

string s = @"Hello John

these are the files you have to send us today: C:\Development\Projects 2010\Accounting\file20101130.csv, C:\Development\Projects 2010\Accounting\orders20101130.docx

also we would like you to send C:\Development\Projects 2010\Accounting\customersupdated.xls

thank you";

Regex regexObj = new Regex(@"\b[a-z]:\\(?:[^<>:""/\\|?*\n\r\0-\37]+\\)*[^<>:""/\\|?*\n\r\0-\37]+\.[a-z0-9\.]{1,5}", RegexOptions.IgnorePatternWhitespace|RegexOptions.IgnoreCase);
MatchCollection fileNameMatchCollection = regexObj.Matches(s);
foreach (Match fileNameMatch in fileNameMatchCollection)
{
    MessageBox.Show(fileNameMatch.Value);
}

In this case, I limited extensions to a length of 1-5 characters. You can obviously use another value or restrict the characters allowed in filename extensions further. The list of valid characters is taken from the MSDN article Naming Files, Paths, and Namespaces.

Jim Brissom
  • 31,821
  • 4
  • 39
  • 33
-1

If you use <file> tag and the final text could be represented as well formatted xml document (as far as being inner xml, i.e. text without root tags), you probably can do:

var doc = new XmlDocument();
doc.LoadXml(String.Concat("<root>", input, "</root>"));

var files = doc.SelectNodes("//file"):

or

var doc = new XmlDocument();

doc.AppendChild(doc.CreateElement("root"));
doc.DocumentElement.InnerXml = input;

var nodes = doc.SelectNodes("//file");

Both method really works and are highly object-oriented, especially the second one.

And will bring rather more performance.

See also - Don't parse (X)HTML using RegEx

Community
  • 1
  • 1
abatishchev
  • 98,240
  • 88
  • 296
  • 433
  • @Aillyn: No, it is NOT. Parsing well formed XML with RegEx - is much, much worse – abatishchev Sep 27 '10 at 17:51
  • It happens that the OP is using a subset of XML (if you call it that) that *is* regular, thus, it *can* be parsed with RegEx. There is absolutely no need for a XML parser. – Aillyn Sep 27 '10 at 22:06