Single regular expression in Python with named groups for interleaved text

Question

I would like to create a single regular expression in Python that extracts two interleaved portions of text from a filename as named groups. An example filename is given below:

CM00626141_H12.d4_T0001F003L01A02Z03C02.tif

The part of the filename I'd like to extract is contained between the underscores, and consists of the following:

An uppercase letter: [A-H]
A zero-padded two-digit number: 01 to 12
A period
A lowercase letter: [a-d]
A single digit: 1 to 4

For the example above, I would like one group ('Row') to contain H.d, and the other group ('Column') to contain 12.4. However, I don't know how to do this this when the text is separated as it is here.

EDIT: A constraint which I omitted: it needs to be a single regex to handle the string. I've updated the text/title to reflect this point.

does any of the answers solve your problem? If so, please mark it as answer (green checkmark below votes)... — nozzleman, Nov 10 '16 at 09:52
I didn't mention that I need a single regex to cover this; i've updated the text accordingly. Hence, since @jasonharper indicates it's not doable, that's my accepted answer. — braymp, Nov 10 '16 at 15:35
Is it possible to use positive lookbehind and positive lookahead to do the job? E.g., http://stackoverflow.com/questions/277547/regular-expression-to-skip-character-in-capture-group — braymp, Nov 10 '16 at 15:41

Moinuddin Quadri · Answer 1 · 2016-11-09T20:31:01.860

You may do it in two steps using re.findall() as:

Step 1: Extract substring from the main string following your pattern as:

>>> import re

>>> my_file = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> my_content = re.findall(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', my_file)
# where content of my_content is: [('H', '12', 'd', '4')]

Step 2: Join tuples to get the value of row and column:

>>> row = ".".join(my_content[0][::2])
>>> row
'H.d'

>>> column = ".".join(my_content[0][1::2])
>>> column
'12.4'

score 1 · Accepted Answer · answered Nov 09 '16 at 20:13

1

Regexp capturing groups (whether numbered or named) do not actually capture text - they capture starting/ending indices within the original text. Thus, it is impossible for them to capture non-contiguous text. Probably the best thing to do here is have four separate groups, and combine them into your two desired values manually.

answered Nov 09 '16 at 20:13

jasonharper

9,450
2
18
42

Accepting this answer, since it addressed a constraint which I forgot to include in the OP. OP edited accordingly. – braymp Nov 10 '16 at 15:36

user108471 · Answer 3 · 2016-11-09T20:29:13.913

I do not believe there is any way to capture everything you want in exactly two named capture groups and one regex call. The most straightforward way I see is to do the following:

>>> import re
>>> source = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> match = re.search(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', source)
>>> row, column = '.'.join(match.groups()[0::2]), '.'.join(match.groups()[1::2])
>>> row
'H.d'
>>> column
'12.4'

Alternatively, you might find it more appealing to handle the parsing almost completely in the regex:

>>> row, column = re.sub(
        r'^.*_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_.*$',
        r'\1.\3,\2.\4',
        source).split(',')
>>> row, column
('H.d', '12.4')

Single regular expression in Python with named groups for interleaved text

3 Answers3