2

I am new to python and am trying to parse a long text string for sub-strings between two exact patterns. The problem lies in telling python to stop at the first occurrence of the end pattern. I also need to collect all instances of the sub-strings into place them into an array storage to be used later on. I am trying to utilize the (re) module example here by Nikolaus Gradwohl for simplicity. Below is an example of what I have done.

import re
string='valuepattern1":"capture",abcdpattern1":"capture2",defg'
result = re.search('pattern1":"(.*)",', string)
print result.group(1)

Output: capture",abcdpattern1":"capture2"

Here I am trying to collect every instance of capture (capture and capture2) found in the string between the set beginning point of (pattern1":") and the immediate ending point (",) after capture. Each instance collected needs to be added to an array, as shown below.

print result
Output: [capture,capture2]

Note that capture does not have a set length and varies all throughout the string, however, the beginning and ending patterns remain consistent through the string.

Thank you in advance for any help on this matter.

Community
  • 1
  • 1
  • 1
    Is this input supposed to be JSON or part of a larger JSON string? Python has a JSON parser. Even if it's not JSON, it looks like the kind of thing where regex parsing could be quite fragile. – user2357112 Apr 13 '16 at 21:01
  • Thank you for the information @user2357112 . The string is a compacted JSON file. I've been trying to convert old scripts done in bash, by a previous student, to that of python. A majority of their script contains regex commands which I am not partial to, nor of which I like as it is impractical in this case. – Marc Morgan Apr 13 '16 at 22:24

1 Answers1

2

You need to change the pattern so that the . in the capturing group doesn't match the closing quotation mark. I can see two reasonable ways to do it:

First, you could use a non-greedy wildcard: pattern1":"(.*?)". The *? tells it to match the smallest possible number of characters, rather than the largest possible number.

The second option is to use a character class to exclude quotation marks from the captured part of the pattern: pattern1":"([^"]*)" Using a ^ as the first character in the brackets tells it to exclude the rest of the characters, so [^"] is any non-quotation-mark character.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • If you haven't done a lot of regex work before, Python's implementation isn't the kindest. I've found [Pythex](http://pythex.org/) to be incredibly helpful when I haven't done one in a while and if you test the solution by @Blckknght you'll see just what changes when you use a non-greedy regex. – Sam Apr 13 '16 at 21:14
  • Thank you both for the information and assistance on the matter. The Pythex site will be very useful for future practice. – Marc Morgan Apr 13 '16 at 22:28