Declaring a stopping position when extracting multiple substrings in Python

Question

I am new to python and am trying to parse a long text string for sub-strings between two exact patterns. The problem lies in telling python to stop at the first occurrence of the end pattern. I also need to collect all instances of the sub-strings into place them into an array storage to be used later on. I am trying to utilize the (re) module example here by Nikolaus Gradwohl for simplicity. Below is an example of what I have done.

import re
string='valuepattern1":"capture",abcdpattern1":"capture2",defg'
result = re.search('pattern1":"(.*)",', string)
print result.group(1)

Output: capture",abcdpattern1":"capture2"

Here I am trying to collect every instance of capture (capture and capture2) found in the string between the set beginning point of (pattern1":") and the immediate ending point (",) after capture. Each instance collected needs to be added to an array, as shown below.

print result
Output: [capture,capture2]

Note that capture does not have a set length and varies all throughout the string, however, the beginning and ending patterns remain consistent through the string.

Thank you in advance for any help on this matter.

Is this input supposed to be JSON or part of a larger JSON string? Python has a JSON parser. Even if it's not JSON, it looks like the kind of thing where regex parsing could be quite fragile. — user2357112, Apr 13 '16 at 21:01
Thank you for the information @user2357112 . The string is a compacted JSON file. I've been trying to convert old scripts done in bash, by a previous student, to that of python. A majority of their script contains regex commands which I am not partial to, nor of which I like as it is impractical in this case. — Marc Morgan, Apr 13 '16 at 22:24

score 2 · Accepted Answer · answered Apr 13 '16 at 21:01

2

You need to change the pattern so that the . in the capturing group doesn't match the closing quotation mark. I can see two reasonable ways to do it:

First, you could use a non-greedy wildcard: pattern1":"(.*?)". The *? tells it to match the smallest possible number of characters, rather than the largest possible number.

The second option is to use a character class to exclude quotation marks from the captured part of the pattern: pattern1":"([^"]*)" Using a ^ as the first character in the brackets tells it to exclude the rest of the characters, so [^"] is any non-quotation-mark character.

answered Apr 13 '16 at 21:01

Blckknght

100,903
11
120
169

If you haven't done a lot of regex work before, Python's implementation isn't the kindest. I've found [Pythex](http://pythex.org/) to be incredibly helpful when I haven't done one in a while and if you test the solution by @Blckknght you'll see just what changes when you use a non-greedy regex. – Sam Apr 13 '16 at 21:14
Thank you both for the information and assistance on the matter. The Pythex site will be very useful for future practice. – Marc Morgan Apr 13 '16 at 22:28

Declaring a stopping position when extracting multiple substrings in Python

1 Answers1