0

I'm using Regex list to parse FTP server listing. I'm not good with Regex at all, this is list of regex I collected online to parse various server FTP outputs:

private static readonly string[] DirectoryParseFormats = 
        {
            "(?<dir>[\\-d])(?<permission>([\\-r][\\-w][\\-xs]){3})\\s+\\d+\\s+\\w+\\s+\\w+\\s+(?<size>\\d+)\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{4})\\s+(?<name>.+)",
            "(?<dir>[\\-d])(?<permission>([\\-r][\\-w][\\-xs]){3})\\s+\\d+\\s+\\d+\\s+(?<size>\\d+)\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{4})\\s+(?<name>.+)",
            "(?<dir>[\\-d])(?<permission>([\\-r][\\-w][\\-xs]){3})\\s+\\d+\\s+\\d+\\s+(?<size>\\d+)\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{1,2}:\\d{2})\\s+(?<name>.+)",
            "(?<dir>[\\-d])(?<permission>([\\-r][\\-w][\\-xs]){3})\\s+\\d+\\s+\\w+\\s+\\w+\\s+(?<size>\\d+)\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{1,2}:\\d{2})\\s+(?<name>.+)",
            "(?<dir>[\\-d])(?<permission>([\\-r][\\-w][\\-xs]){3})(\\s+)(?<size>(\\d+))(\\s+)(?<ctbit>(\\w+\\s\\w+))(\\s+)(?<size2>(\\d+))\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{2}:\\d{2})\\s+(?<name>.+)",
            "(?<timestamp>\\d{2}\\-\\d{2}\\-\\d{2}\\s+\\d{2}:\\d{2}[Aa|Pp][mM])\\s+(?<dir>\\<\\w+\\>){0,1}(?<size>\\d+){0,1}\\s+(?<name>.+)"
        };

Now I stumbled upon following output from odd FTP server. What's weird is that server outputs file name together with folder name for some reason.

Anyway, I'd like to have similar RegEx for this string, ideally introduce folder name to separate it out, String returned by server is what's inside pipes |

|-rw-rw-rw- 1 generic 235 Mar 22 11:21 fromDoder/DOD997ABCD.20170322112114159.1961812284.txt|

EDIT:

Here is C# code I use to iterate through regex expressions to pick one that matches FTP server output. Then I use it to parse out file name and type

// Use our regex library to parse
match = DirectoryParseFormats.Select(dpf => new Regex(dpf).Match(raw)).FirstOrDefault(m => m.Success); 

if (match == null) throw new Exception($"Can't parse FTP directory list item. raw item: |{raw}|, whole response: |{response}|");

// If not directory - this is file
var dir = match.Groups["dir"].Value;
if (dir == string.Empty || dir == "-") list.Add(match.Groups["name"].Value);

EDIT 2:

total 0
drw-rw-rw-   1 user     group           0 Apr 23  2016 .
drw-rw-rw-   1 user     group           0 Apr 23  2016 ..

EDIT 3:

var hintRegex = @"^
(?<dir>[-d])
(?<permission>(?:[-r][-w][-xs]){3})
\s+\d+
\s+\w+
(?:\s+\w+)?
\s+(?<size>\d+)
\s+(?<timestamp>\w+\s+\d+(?:\s+\d+(?::\d+)?))
\s+(?!(?:\.|\.\.)\s*$)(?<name>.+?)\s*
$";

            Match match = new Regex(hintRegex).Match("-rw-r--r-- 1 ftp ftp           1079 Apr 06  2017 LEANCOR_040617084839.txt");

            if (!match.Success) Debug.WriteLine("Doesn't match");
katit
  • 17,375
  • 35
  • 128
  • 256
  • Are you trying to match `ls -l` output? – vallentin Mar 22 '17 at 16:27
  • No, I use C# `Match` method, see edited question. Ok, I read again - I don't know what I'm trying to match, this is their server output. Probably `ls -l` but I don't know – katit Mar 22 '17 at 16:31
  • That's not what I'm asking. Your regex, it looks like you're trying to match the output of `ls -l`. So are you? Could you give an example of what you're trying to match and what should match and what shouldn't. – vallentin Mar 22 '17 at 16:32
  • @Vallentin I don't know if it's `ls -l` or what. It's 3rd party FTP server responses to LIST command. Example I provided DOES NOT match any of my regexes (most servers do, not this one). I need regex in similar manner that will match this output – katit Mar 22 '17 at 16:34

2 Answers2

1

The regex for the given string input goes as under:

(?<permission>([\\-rwxs]+){3})\\s+\\d+\\s+\\w+\\s+(?<size>\\d+)\\s+(?<timestamp>\\w+\\s+\\d+\\s+\\d{1,2}:\\d{1,2})\\s+(?<folder>\\w+\\/)?(?<name>.+)

The online regex test including regex pattern and the given input string is shown in the image below.

See the online regex test output given below

Community
  • 1
  • 1
Rupesh
  • 242
  • 2
  • 6
1

Since your pattern looks like you're trying to match the output of ls -l, as well as you mentioning it's a list command. I'm assuming it is so.

The main problem I could gather from your code is that you're missing the multiline flag (RegexOptions.Multiline).

Your regex overall seems correct, I only did a few changes. Here's it layed out with a bit of spacing (which still works if you use the extended flag).

^
(?<dir>[-d])
(?<permission>(?:[-r][-w][-xs]){3})
\s+\d+
\s+\w+
(?:\s+\w+)?
\s+(?<size>\d+)
\s+(?<timestamp>\w+\s+\d+(?:\s+\d+(?::\d+)?))
\s+(?!(?:\.|\.\.)\s*$)(?<name>.+?)\s*
$

Here's a live preview.

You can test it by doing:

string pattern = @"^(?<dir>[-d])(?<permission>(?:[-r][-w][-xs]){3})\s+\d+\s+\w+(?:\s+\w+)?\s+(?<size>\d+)\s+(?<timestamp>\w+\s+\d+(?:\s+\d+(?::\d+)?))\s+(?!(?:\.|\.\.)\s*$)(?<name>.+?)\s*$";
Regex re = new Regex(pattern, RegexOptions.Multiline);

string source = @"
-rwxr-xr-x 1 root  46789 Feb  7 23:15 certbot-auto
drwxr-xr-x 2 root   4096 Mar 22 16:29 test dir
drwxr-xr-x 4 root   4096 Feb 10 15:50 www
-rw-rw-rw- 1 generic 235 Mar 22 11:21 fromDoder/DOD997ABCD.20170322112114159.1961812284.txt
-rw-rw-rw- 1 cmuser cmuser 904 Mar 23 15:04 20170323110427785_3741647.edi
drw-rw-rw- 1 user   group    0 Apr 23  2016 .
drw-rw-rw- 1 user   group    0 Apr 23  2016 ..
drw-rw-rw- 1 user   group    0 Apr 23  2016 .cache
drw-rw-rw- 1 user   group    0 Apr 23  2016 .bashrc
";

MatchCollection matches = re.Matches(source);

Console.WriteLine(matches.Count);

foreach (Match match in matches)
{
    Console.WriteLine(match.Groups["dir"]);
    Console.WriteLine(match.Groups["permission"]);
    Console.WriteLine(match.Groups["size"]);
    Console.WriteLine(match.Groups["timestamp"]);
    Console.WriteLine(match.Groups["name"]);
    Console.WriteLine();
}

Note that the content of source is just an edited version of the output of ls -l on my server (with the addition of your example). So if my assumptions are correct, it should look familiar to you.

Edit: Based on your comment, you simply need to remove one of the \s+\w+ (I've updated all the above to reflect that).

vallentin
  • 23,478
  • 6
  • 59
  • 81
  • Yes, but I wanted regex which will handle output specifically as I give in my post, your regex won't parse it. Sample: `-rw-rw-rw- 1 generic 235 Mar 22 11:21 fromDoder/DOD997ABCD.20170322112114159.1961812284.txt` – katit Mar 22 '17 at 17:47
  • @katit Take a look at my update. It also matches the sample now. – vallentin Mar 22 '17 at 18:04
  • How do I parse out "Folder" part in this regex? – katit Mar 22 '17 at 19:39
  • 1
    @katit If you insist on using regex, then you could [do something like `(^[^\/]*)`](https://regex101.com/r/Te46gJ/1). However it would probably be better to utilize [`new FileInfo(path).Directory.FullName`](https://msdn.microsoft.com/en-us/library/system.io.path.getdirectoryname.aspx). – vallentin Mar 23 '17 at 08:16
  • Following string is not parsed using this regex, can you assist with proper modification? :) `-rw-rw-rw- 1 cmuser cmuser 904 Mar 23 15:04 20170323110427785_3741647.edi` – katit Mar 23 '17 at 15:07
  • 1
    @katit The first regex would have parsed that. I have edited the answer to accommodate both. Check out the live preview to see working! :) – vallentin Mar 23 '17 at 15:11
  • How about this?? :) I got output like this from one server: I added "EDIT 2" - issue with first line "total 0" - would be ideal if regex can ignore those – katit Mar 23 '17 at 16:12
  • @katit Now it ignores `.` and `..` but not something like `.cache` (assuming that's what you want). Check the example if you want to see it working! – vallentin Mar 23 '17 at 16:26
  • Actually . and .. work as "folder" so fine with me. `total 0` on first line is what causing issues with specific FTP server – katit Mar 23 '17 at 16:28
  • `total 0` shouldn't break the regex considering the `RegexOptions.Multiline` flag. So just to be sure, is there still a problem? – vallentin Mar 23 '17 at 16:29
  • I don't use it as you have in example. I break lines myself in C# and then regex each line separately. In this case when I apply regex to "line 0" it fails to match, For what I'm doing I would be OK with matching it as directory. Right now we need list of files only, don't care about directories at all – katit Mar 23 '17 at 16:31
  • You can use `(?!(?:\.|\.\.)\s*$)` (as in my edit) to make it ignore `.` and `..`. However, if you want to ignore directories I would recommend using [`File.GetAttributes`](http://stackoverflow.com/questions/1395205/better-way-to-check-if-a-path-is-a-file-or-a-directory/1395226#1395226). As regex wouldn't be able to know if `test` as is, would be a directory or file. – vallentin Mar 23 '17 at 16:36
  • I tell directory or file by first char (-/d). Line "total 0" is what causing regex to fail, I wonder if it can be parsed out somehow – katit Mar 23 '17 at 16:38
  • Yes. If you want to ignore all directories, then instead of `(?[-d])` you could use `-`. [Here's a preview](https://regex101.com/r/JS3gjo/8) (haven't updated the answer). – vallentin Mar 23 '17 at 17:20
  • Hi! Trying to learn myself now, but your sample link not seem to be running now? I got another example which doesn't parse :( `-rw-r--r-- 1 ftp ftp 1260 Apr 03 2017 LEANCOR_040317094934.txt` – katit Apr 03 '17 at 20:23
  • Which part is not working? [Because it seems to work, if you take a look here](https://regex101.com/r/JS3gjo/11) – vallentin Apr 03 '17 at 21:56
  • Now it does, I guess there was issue with Regex101.com – katit Apr 03 '17 at 21:58
  • @Valentin Not sure what is the problem now.. This particular string: `-rw-r--r-- 1 ftp ftp 1079 Apr 06 2017 LEANCOR_040617084839.txt` does not match against RegEx. It DOES match on test website (your link). But does not match when I use RegEx expression in C# code – katit Apr 06 '17 at 22:39
  • I just added "EDIT 3" illustrating problem. But it works on Regex101.. Hmm – katit Apr 06 '17 at 22:44
  • It's because on regex101 the extended flag is enabled. Note that in my answer all the newlines aren't there. If you want to keep those newlines, you need to equally enabled the extended flag in C#, which is `RegexOptions.IgnorePatternWhitespace`. – vallentin Apr 07 '17 at 00:11