Find the indexes of all regex matches?

Question

I'm parsing strings that could have any number of quoted strings inside them (I'm parsing code, and trying to avoid PLY). I want to find out if a substring is quoted, and I have the substrings index. My initial thought was to use re to find all the matches and then figure out the range of indexes they represent.

It seems like I should use re with a regex like \"[^\"]+\"|'[^']+' (I'm avoiding dealing with triple quoted and such strings at the moment). When I use findall() I get a list of the matching strings, which is somewhat nice, but I need indexes.

My substring might be as simple as c, and I need to figure out if this particular c is actually quoted or not.

Sounds like the job not suitable for regexes. – Daniel Kluev Aug 19 '10 at 07:19 — Daniel Kluev, Aug 19 '10 at 07:19

score 223 · Accepted Answer · edited Feb 15 '13 at 13:12

223

This is what you want: (source)

re.finditer(pattern, string[, flags]) 
Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

You can then get the start and end positions from the MatchObjects.

e.g.

[(m.start(0), m.end(0)) for m in re.finditer(pattern, string)]

edited Feb 15 '13 at 13:12

Balthazar Rouberol

6,822
2
35
41

answered Aug 19 '10 at 07:22

Dave Kirby

25,806
5
67
84

56

Note that you can actually use `m.span()` to get `(m.start(), m.end())` (and the default group argument is `0`, so that can be omitted). – Amber Mar 13 '11 at 05:08
1

Brilliant. Was looking for exactly this. – armandino Jul 09 '12 at 14:49
6

attention, it fails in this case: base_str = "GATATATGCATATACTT" sub_str = "ATAT", the result should be [(1,5), (3, 7), (9, 13)], but it turns out [(1, 5), (9, 13)] – unionx Dec 07 '14 at 16:39
@unionx if you have better solution, this is your choice. – Burger King May 12 '15 at 09:50
6

@unionx: finditer(), as per the documentation, returns non-overlapping matches. – Talia May 28 '15 at 15:26
2

A [much more recent example](https://www.tutorialspoint.com/How-do-we-use-re-finditer-method-in-Python-regular-expression), with 2018 syntax – Nathan majicvr.com Jun 10 '20 at 20:03

score 3 · Answer 2 · edited Dec 05 '22 at 21:16

3

To get indice of all occurences:

S = input() # Source String 
k = input() # String to be searched
import re
pattern = re.compile(k)
r = pattern.search(S)
if not r: print("(-1, -1)")
while r:
    print("({0}, {1})".format(r.start(), r.end() - 1))
    r = pattern.search(S,r.start() + 1)

edited Dec 05 '22 at 21:16

KyleMit

30,350
66
462
664

answered Aug 25 '20 at 03:02

Be Champzz

391
3
6

score 1 · Answer 3 · edited Dec 05 '22 at 21:17

1

This should solve your issue:

pattern=r"(?=(\"[^\"]+\"|'[^']+'))"

Then use the following to get all overlapping indices:

indicesTuple = [(mObj.start(1),mObj.end(1)-1) for mObj in re.finditer(pattern,input)]

edited Dec 05 '22 at 21:17

KyleMit

30,350
66
462
664

answered Apr 20 '20 at 18:37

Omkar Rahane

114
9

Find the indexes of all regex matches?

3 Answers3

Linked