1

I'm trying to write a Python library to parse our version format strings. The (simplified) version string format is as follows:

<product>-<x>.<y>.<z>[-alpha|beta|rc[.<n>]][.<extra>]][.centos|redhat|win][.snb|ivb]

This is:

  • product, ie foo
  • numeric version, ie: 0.1.0
  • [optional] pre-release info, ie: beta, rc.1, alpha.extrainfo
  • [optional] operating system, ie: centos
  • [optional] platform, ie: snb, ivb

So the following are valid version strings:

1) foo-1.2.3
2) foo-2.3.4-alpha
3) foo-3.4.5-rc.2
4) foo-4.5.6-rc.2.extra
5) withos-5.6.7.centos
6) osandextra-7.8.9-rc.extra.redhat
7) all-4.4.4-rc.1.extra.centos.ivb

For all of those examples, the following regex works fine:

^(?P<prod>\w+)-(?P<maj>\d).(?P<min>\d).(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?P<pre_x>\w+))?)?(?:\.(?P<os>centos|redhat|win))?(?:\.(?P<plat>snb|ivb))?$

But the problem comes in versions of this type (no 'extra' pre-release information, but with os and/or platform):

8) issue-0.1.0-beta.redhat.snb

With the above regex, for string #8, redhat is being picked up in the pre-release extra info pre_x, instead of the os group.

I tried using look-behind to avoid picking the os or platform strings in pre_x:

...(?:\.(?P<pre_x>\w+))?(?<!centos|redhat|win|ivb|snb))...

That is:

^(?P<prod>\w+)-(?P<maj>\d).(?P<min>\d).(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?P<pre_x>\w+))?(?<!centos|redhat|win|ivb|snb))?(?:\.(?P<os>centos|redhat|win))?(?:\.(?P<plat>snb|ivb))?$

This would work fine if Python's standard module re could accept variable width look behind. I would rather try to stick to the standard module, rather than using regex as my library is quite likely to be distributed to a large number machines, where I want to limit dependencies.

I've also had a look at similar questions: this, this and this are not aplicable.

Any ideas on how to achieve this?

My regex101 link: https://regex101.com/r/bH0qI7/3

[For those interested, this is the full regex I'm actually using: https://regex101.com/r/lX7nI6/2]

Community
  • 1
  • 1
Xabs
  • 4,544
  • 3
  • 18
  • 22
  • 1
    Could transforming your regex to use lookaheads help with anything? – rr- May 27 '15 at 13:47
  • Yes, I don't mind using lookaheads, I just want to stick to regex and the standard `re` module. TBH, I'm lost transforming this to lookaheads. – Xabs May 27 '15 at 13:49

2 Answers2

2

You need to use negative lookahead assertion to make (?P<pre_x>\w+) to match any except for centos or redhat.

^(?P<prod>\w+)-(?P<maj>\d)\.(?P<min>\d)\.(?P<bug>\d)(?:-(?P<pre>alpha|beta|rc)(?:\.(?P<pre_n>\d))?(?:\.(?:(?!centos|redhat)\w)+)?)?(?:\.(?P<os>centos|redhat))?(?:\.(?P<plat>snb|ivb))?$

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • 1
    That was easy!! Thanks a lot – Xabs May 27 '15 at 14:01
  • 1
    A nit pick @Avinash Raj ? shouldn't it be \d+ in maj/min/bug? I followed the link in the demo for say - foo-3.4.15-rc.2 That doesn't match. (with release early and release fast! ;-) it's not too hard to have version numbers that can go in two digits (if not 3? :-) ). – gabhijit May 27 '15 at 14:52
1

Actually I'd avoid using the regex, since it looks pretty horrible already, and you told us it's only simplified. It's much more readable to parse it by hand:

def extract(text):
    parts = text.split('-')
    ret = {}
    ret['name'] = parts.pop(0)
    ret['version'] = parts.pop(0).split('.')

    if len(parts) > 0:
        rest_parts = parts.pop(0).split('.')
        if rest_parts[-1] in ['snb', 'ivb']:
            ret['platform'] = rest_parts.pop(-1)
        if rest_parts[-1] in ['redhat', 'centos', 'win']:
            ret['os'] = rest_parts.pop(-1)
        ret['extra'] = rest_parts

    return ret

tests = \
[
    'foo-1.2.3',
    'foo-2.3.4-alpha',
    'foo-3.4.5-rc.2',
    'foo-4.5.6-rc.2.extra',
    'withos-5.6.7.centos',
    'osandextra-7.8.9-rc.extra.redhat',
    'all-4.4.4-rc.1.extra.centos.ivb',
    'issue-0.1.0-beta.redhat.snb',
]

for test in tests:
    print(test, extract(test))

Result:

('foo-1.2.3', {'version': ['1', '2', '3'], 'name': 'foo'})
('foo-2.3.4-alpha', {'version': ['2', '3', '4'], 'name': 'foo', 'extra': ['alpha']})
('foo-3.4.5-rc.2', {'version': ['3', '4', '5'], 'name': 'foo', 'extra': ['rc', '2']})
('foo-4.5.6-rc.2.extra', {'version': ['4', '5', '6'], 'name': 'foo', 'extra': ['rc', '2', 'extra']})
('withos-5.6.7.centos', {'version': ['5', '6', '7', 'centos'], 'name': 'withos'})
('osandextra-7.8.9-rc.extra.redhat', {'version': ['7', '8', '9'], 'os': 'redhat', 'name': 'osandextra', 'extra': ['rc', 'extra']})
('all-4.4.4-rc.1.extra.centos.ivb', {'platform': 'ivb', 'version': ['4', '4', '4'], 'os': 'centos', 'name': 'all', 'extra': ['rc', '1', 'extra']})
('issue-0.1.0-beta.redhat.snb', {'platform': 'snb', 'version': ['0', '1', '0'], 'os': 'redhat', 'name': 'issue', 'extra': ['beta']})
rr-
  • 14,303
  • 6
  • 45
  • 67
  • Thanks, this indeed looks much cleaner, but if you start adding more complexity it becomes quite dirty pretty quick: ie: extend to allow for a more flexible version format like `foo_1.2.3`, `foo1.2.3`, `foo.1.2.3`, or even missing bugfix: `foo-1.2` (=`foo-1.2.0`)... Keep doing this for every token and you'll end up with a massive piece of code, even more difficult to debug than a regex – Xabs May 27 '15 at 14:27