1

I would like to create a single regular expression in Python that extracts two interleaved portions of text from a filename as named groups. An example filename is given below:

CM00626141_H12.d4_T0001F003L01A02Z03C02.tif

The part of the filename I'd like to extract is contained between the underscores, and consists of the following:

  • An uppercase letter: [A-H]
  • A zero-padded two-digit number: 01 to 12
  • A period
  • A lowercase letter: [a-d]
  • A single digit: 1 to 4

For the example above, I would like one group ('Row') to contain H.d, and the other group ('Column') to contain 12.4. However, I don't know how to do this this when the text is separated as it is here.

EDIT: A constraint which I omitted: it needs to be a single regex to handle the string. I've updated the text/title to reflect this point.

braymp
  • 241
  • 2
  • 11
  • does any of the answers solve your problem? If so, please mark it as answer (green checkmark below votes)... – nozzleman Nov 10 '16 at 09:52
  • I didn't mention that I need a single regex to cover this; i've updated the text accordingly. Hence, since @jasonharper indicates it's not doable, that's my accepted answer. – braymp Nov 10 '16 at 15:35
  • Is it possible to use positive lookbehind and positive lookahead to do the job? E.g., http://stackoverflow.com/questions/277547/regular-expression-to-skip-character-in-capture-group – braymp Nov 10 '16 at 15:41

3 Answers3

1

You may do it in two steps using re.findall() as:

Step 1: Extract substring from the main string following your pattern as:

>>> import re

>>> my_file = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> my_content = re.findall(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', my_file)
# where content of my_content is: [('H', '12', 'd', '4')]

Step 2: Join tuples to get the value of row and column:

>>> row = ".".join(my_content[0][::2])
>>> row
'H.d'

>>> column = ".".join(my_content[0][1::2])
>>> column
'12.4'
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
1

Regexp capturing groups (whether numbered or named) do not actually capture text - they capture starting/ending indices within the original text. Thus, it is impossible for them to capture non-contiguous text. Probably the best thing to do here is have four separate groups, and combine them into your two desired values manually.

jasonharper
  • 9,450
  • 2
  • 18
  • 42
  • Accepting this answer, since it addressed a constraint which I forgot to include in the OP. OP edited accordingly. – braymp Nov 10 '16 at 15:36
0

I do not believe there is any way to capture everything you want in exactly two named capture groups and one regex call. The most straightforward way I see is to do the following:

>>> import re
>>> source = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> match = re.search(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', source)
>>> row, column = '.'.join(match.groups()[0::2]), '.'.join(match.groups()[1::2])
>>> row
'H.d'
>>> column
'12.4'

Alternatively, you might find it more appealing to handle the parsing almost completely in the regex:

>>> row, column = re.sub(
        r'^.*_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_.*$',
        r'\1.\3,\2.\4',
        source).split(',')
>>> row, column
('H.d', '12.4')
user108471
  • 2,488
  • 3
  • 28
  • 41