3

In a c# program, I want to write file, in a folder where other file may exists. If so, a suffix may be added to the file myfile.docx, myfile (1).docx, myfile (2).docx and so on.

I'm struggling at analysing existing file name to extract existing files' name parts.

Especially, I use this regex: (?<base>.+?)(\((?<idx>\d+)\)?)?(?<ext>(\.[\w\.]+)).

This regex outputs:

╔═══════════════════════╦══════════════╦═════╦═══════════╦═══════════════════════════════════╗
║    Source Filename    ║     base     ║ idx ║ extension ║              Success              ║
╠═══════════════════════╬══════════════╬═════╬═══════════╬═══════════════════════════════════╣
║ somefile.docx         ║ somefile     ║     ║ .docx     ║ Yes                               ║
║ somefile              ║              ║     ║           ║ No, base should be "somefile"     ║
║ somefile (6)          ║              ║     ║           ║ No, base should be "somefile (6)" ║
║ somefile (1).docx     ║ somefile     ║   1 ║ .docx     ║ Yes                               ║
║ somefile (2)(1).docx  ║ somefile (2) ║   1 ║ .docx     ║ Yes                               ║
║ somefile (4).htm.tmpl ║ somefile     ║   4 ║ .htm.tmpl ║ Yes                               ║
╚═══════════════════════╩══════════════╩═════╩═══════════╩═══════════════════════════════════╝

As you can see, all cases are working excepted when a file name has no extension.

How to fix my regex to solve the failling cases ?

Reproduction : https://regex101.com/r/q9uQii/1

If it matterns, here the relevant C# code :

private static readonly Regex g_fileNameAnalyser = new Regex(
    @"(?<base>.+?)(\((?<idx>\d+)\)?)?(?<ext>(\.[\w\.]+))", 
    RegexOptions.Compiled | RegexOptions.ExplicitCapture
    );

...

var candidateMatch = g_fileNameAnalyser.Match(somefilename);
var candidateInfo = new
{
    baseName = candidateMatch.Groups["base"].Value.Trim(),
    idx = candidateMatch.Groups["idx"].Success ? int.Parse(candidateMatch.Groups["idx"].Value) : 0,
    ext = candidateMatch.Groups["ext"].Value
};
Steve B
  • 36,818
  • 21
  • 101
  • 174
  • Why don't you just check for file existence in a loop incrementing index by one until you find available file name? – Peter Wolf Dec 10 '19 at 21:34
  • 1
    Use `^(?.+?)(?:\((?\d+)\))?(?\.[\w.]+)?$` or `^(?.+?)\s*(?:\((?\d+)\))?(?\.[\w.]+)?$`, see [demo](https://regex101.com/r/aeeXLj/1) – Wiktor Stribiżew Dec 10 '19 at 21:38
  • @PeterWolf: it may be a solution. I'll consider it – Steve B Dec 10 '19 at 21:50
  • @WiktorStribiżew: your regex seems to work as expected. you should post an answer – Steve B Dec 10 '19 at 21:52
  • for last example, note that the OS consider `.tmpl` as extension and `somefile (4).htm` as file name, see this [fiddle](https://dotnetfiddle.net/34KGGv). If you can adjust your requirement, I would stay with the OS meaning of "extension", getting rid of extension with `Path.GetFileNameWithoutExension` and then parse somehow the resulting string looking for the (optional) index at the end – Gian Paolo Dec 11 '19 at 00:18
  • thx @GianPaolo for the warning. My business requirement is to consider each *dot something* to be part of the extension so I'll keep it as is. But you are probably right in a more general way. I must be vigilant with some edge cases like "some.files.with.dots (3).xml" and "some.files.with.dots.xml" that don't return the same basename – Steve B Dec 11 '19 at 08:07
  • @SteveB, I'm surely biased since I quite hate RegExp [_Some people, when confronted with a problem, think "I know, I'll use regular expressions". Now they have two problems_](https://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/). It sounds to me that it should be easier to work it out without a regex, but as I said, I'm surely biased – Gian Paolo Dec 11 '19 at 11:19
  • @GianPaolo: the right tool for the right job. Not always easy to properly answer :) – Steve B Dec 11 '19 at 12:11

2 Answers2

1

What you might do is repeat the () part that contains digits asserting there is a next pair. Then capture that next part with the digits as the idx group.

Make the idx group and the ext group optional using a question mark.

^(?<base>[^\r\n.()]+(?:(?:\(\d+\))*(?=\(\d+\)))?)(?:\((?<idx>\d+)\))?(?<ext>(?:\.[\w\.]+))?$
  • ^ Start of string
  • (?<base> Start base group
    • [^\r\n.()]+ Match 1+ times any char except the listed
    • (?: Non capturing group
      • (?:\(\d+\))*(?=\(\d+\)) Repeat matching (digits) until there is 1 (digits) part left at the right
    • )? Close group and make it optional
  • ) End base group
  • (?:\((?<idx>\d+)\))? Optional part to match idx group between ( and )
  • (?<ext>(?:\.[\w\.]+))? Optional ext group
  • $ End of string

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

You may use

^(?<base>.+?)\s*(?:\((?<idx>\d+)\))?(?<ext>\.[\w.]+)?$

See the regex demo, results:

enter image description here

Pattern details

  • ^ - start of string
  • (?<base>.+?) - Group "base": any 1 or more chars other than newline, as fewa s possible
  • \s* - 0+ whitespaces
  • (?:\((?<idx>\d+)\))? - an optional sequence of:
    • \( - a ( char
    • (?<idx>\d+) - Group "idx": 1+ digits
    • \) - a ) char
  • (?<ext>\.[\w.]+)? - - an optional Group "ext":
    • \. - a . char
    • [\w.]+ - 1+ letters, digits, _ or . chars
  • $ - end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563