1

Is there a regular expression to match the some.prefix part of both of the following filenames?

  • xyz can be any character of [a-z0-9-_\ ]
  • some.prefix part can be any character in [a-zA-Z0-9-_\.\ ].

I intentionally included a . in some.prefix.

some.prefix.xyz.xyz
some.prefix.xyz

I have tried many combinations. For example:

(?P<prefix>[a-zA-Z0-9-_\.]+)(?:\.[a-z0-9]+\.gz|\.[a-z0-9]+)

It works with abc.def.csv by catching abc.def, but fail to catch it in abc.def.csv.gz.

I primarily use Python, but I thought the regex itself should apply to many languages.

Update: It's not possible, see discussion with @nowox below.

zyxue
  • 7,904
  • 5
  • 48
  • 74
  • Removing anything after last `.` will give filename. Replace `\.[^.]+$`. – Tushar Apr 26 '16 at 02:57
  • I forgot to add that the prefix part can contain `\.`, too. Now added. I wonder if it's possible at all to get such a regex, vaguely remember regex is greedy in a way. – zyxue Apr 26 '16 at 03:00
  • 1
    Check this out http://stackoverflow.com/questions/5899497/checking-file-extension – Robert Apr 26 '16 at 03:03
  • @Robert, I understand it can be done with Python, but for this particular question, I need a regex string, it's used by some other Python function, so it's not really about Python. – zyxue Apr 26 '16 at 03:06

2 Answers2

1

I think your regex works pretty well. I recommend you to trying regex101 with your example:

https://regex101.com/r/dV6cE8/3

The expression

^(?i)[ \w-]+\.[ \w-]+

Should work in your case:

som e.prefix.xyz.xyz
^^^^^^^^^^^
some.prefix.xyz
^^^^^^^^^^^
abc.def.csv.gz
^^^^^^^

And in Python you can use:

import re

text = """some.prefix.xyz.xyz
some.prefix.xyz
abc.def.csv.gz"""

print re.findall('^(?i)[ \w-]+\.[ \w-]+', text, re.MULTILINE)

Which will display:

['som e.prefix', 'some.prefix', 'abc.def']

I might think you are a bit confused about your requirement. If I summarize, you have a pathname made of chars and dot such as:

foo.bar.baz.0
foobar.tar.gz
f.o.o.b.a.r

How would you separate these string into a base-name and an extension? Here we recognize some known patterns .tar.gz is definitely an extension, but is .bar.baz.0 the extension or it is only .0?

The answer is not easy and no regexes in this World would be able to guess the correct answer at 100% without some hints.

For example you can list the acceptable extensions and make some criteria:

  • An extension match the regex \.\w{1,4}$
  • Several extensions may be concatenated together (\.\w{1,4}){1,4}$
  • The remaining is called the basename

From this you can build this regular expression:

(?P<basename>.*?)(?P<extension>(?:\.\w{1,4}){1,4})$
nowox
  • 25,978
  • 39
  • 143
  • 293
  • In the first example, I would just like to want to `some.prefix`, no `xyz`. In the third example, only `abc.def`, no `csv`. – zyxue Apr 26 '16 at 15:57
  • That seems to work. Is it possible to consider space, for example `some.pre fix.xyz.xyz` – zyxue Apr 26 '16 at 16:04
  • I see, I wasn't being super clear in my question. What if the filename is `some.prefix.2.csv.gz`, can your regex match `some.other.prefix.2`? Similarly, there could be `some.prefix.2.3`, `some.prefix.2.3.4` etc. I only want the last one or two exts to be excluded after matching. – zyxue Apr 26 '16 at 16:16
  • The main question is how to differentiate what belong to the file name and what belong to the extension. What do you want to keep in this example: `a.b.c.d.e.f.g`? – nowox Apr 26 '16 at 16:18
  • I see your point. So it's not possible to come up with such a question. Can you add this to your comment, and I'll mark it as the answer. – zyxue Apr 26 '16 at 16:19
-1

Try this[a-z0-9-_\\]+\.[a-z0-9-_\\]+[a-zA-Z0-9-_\.\\]+

AJ333
  • 1
  • 1
  • It worked for me .. just to be clear you want both results to match? – AJ333 Apr 26 '16 at 03:18
  • I want it to be able to extract the `some.prefix` part of a filename. For example, if the filename is `abc.def.csv.gz`, it should be able to extract the `abc.def` from the matched results. You probably need some grouping in your regex string (e.g. with parenthesis) – zyxue Apr 26 '16 at 03:25