1

I have a sequence and a pattern with several brackets (only one level deep)

seq = "TTVGYDRTERDFSSADFTTVGYDRTERDFSSADFTTVGYDRTERDFSSADFTTVGYDRTERDFSSADF"
pattern = "(TT)V(GYD)"

Now I would like to match the pattern and get the beginning and end of the bracketed parts. so for this example something like:

[(0,2), (3,6), (17,19), (20, 23), (34,36), (37,40), (51,53), (54,57)]

I've played around with the re package and thought I almoust had it with the

[reo.group(1).start(), reo.group(1).end() for reo in re.finditer( pattern, sequence )]

but sadly the .group(1) returns only a string and not a "Match Object". Does anyone have a good idea how this could be accomplished?

Magellan88
  • 2,543
  • 3
  • 24
  • 36

1 Answers1

2

You could use the undocumented MatchObject.regs for your purpose. It seems to define the match regions in a (g0, g1, g2, ..., gn) tuple.

import re

seq = "TTVGYDRTERDFSSADFTTVGYDRTERDFSSADFTTVGYDRTERDFSSADFTTVGYDRTERDFSSADF"
pattern = "(TT)V(GYD)"

result = []
for reo in re.finditer(pattern, seq):
    result.extend(reo.regs[1:])

Result:

[(0, 2), (3, 6), (17, 19), (20, 23), (34, 36), (37, 40), (51, 53), (54, 57)]

So the reo.regs for the first match look like this:

(Pdb) reo.regs
((0, 6), (0, 2), (3, 6))

Because you are only interested in the spans of the individual groups, we select all but the first 2-tuples with reo.regs[1:] (slice from index 1 to the end).

Since ((0, 2), (3, 6)) is still a tuple, you would end up with a list [((s0, e0), (s1, e2)), ((s2, e2), (s3, e3)), ...]. In order to keep the list of indices flat, I therefore extend a list instead of appending to it.

Community
  • 1
  • 1
Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • I knew that had to be doable somehow! nice solution, thakns! would you care to make a comment or two about this .expand function, because the help text was nor so helpful... (it just said "built in method") – Magellan88 Apr 24 '14 at 22:00