I think your regex works pretty well. I recommend you to trying regex101 with your example:
https://regex101.com/r/dV6cE8/3
The expression
^(?i)[ \w-]+\.[ \w-]+
Should work in your case:
som e.prefix.xyz.xyz
^^^^^^^^^^^
some.prefix.xyz
^^^^^^^^^^^
abc.def.csv.gz
^^^^^^^
And in Python you can use:
import re
text = """some.prefix.xyz.xyz
some.prefix.xyz
abc.def.csv.gz"""
print re.findall('^(?i)[ \w-]+\.[ \w-]+', text, re.MULTILINE)
Which will display:
['som e.prefix', 'some.prefix', 'abc.def']
I might think you are a bit confused about your requirement. If I summarize, you have a pathname made of chars
and dot
such as:
foo.bar.baz.0
foobar.tar.gz
f.o.o.b.a.r
How would you separate these string into a base-name and an extension? Here we recognize some known patterns .tar.gz
is definitely an extension, but is .bar.baz.0
the extension or it is only .0
?
The answer is not easy and no regexes in this World would be able to guess the correct answer at 100% without some hints.
For example you can list the acceptable extensions and make some criteria:
- An extension match the regex
\.\w{1,4}$
- Several extensions may be concatenated together
(\.\w{1,4}){1,4}$
- The remaining is called the
basename
From this you can build this regular expression:
(?P<basename>.*?)(?P<extension>(?:\.\w{1,4}){1,4})$