2

I am wanting to verify and then parse this string (in quotes):

string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'

I would like to verify that the string starts with 'start:' and ends with ';' Afterward, I would like to have a regex parse out the strings. I tried the following python re code:

regx = r"start: (c?[0-9]+,?)+;" 
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()

I have tried different variations but I can either get the first or the last code but not a list of all three.

Or should I abandon using a regex?

EDIT: updated to reflect part of the problem space I neglected and fixed string difference. Thanks for all the suggestions - in such a short time.

Lars Nordin
  • 2,785
  • 1
  • 22
  • 25
  • Indent code 4 spaces or use the "{}" button in the post editor. I fixed it for you. BTW, did you mean "V1 OIDs" or "start"? – Jim Garrison Jan 10 '11 at 21:48

4 Answers4

5

In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 1
    Looks right to me. You can also check this out. But you can use regex to find the start: and ; and do a two step process. And you might want to check this out. http://stackoverflow.com/questions/1099178/matching-nested-structures-with-regular-expressions-in-python – madmik3 Jan 10 '11 at 22:00
  • Thanks, I wondered about regex groups and repetition for a single regex search() call. I had switched over to using findall() as well but I asked the question here just to see if there was a better way. – Lars Nordin Jan 10 '11 at 22:08
5

You could use the standard string tools, which are pretty much always more readable.

s = "start: c12354, c3456, 34526;"

s.startswith("start:") # returns a boolean if it starts with this string

s.endswith(";") # returns a boolean if it ends with this string

s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "

Donald Miner
  • 38,889
  • 8
  • 95
  • 118
  • Yeah, I know that I *could* use straight string parsing but it I would have to code verifying the string format, versus with a regex you get that right off the bat. – Lars Nordin Apr 11 '11 at 17:02
2

This can be done (pretty elegantly) with a tool like Pyparsing:

from pyparsing import Group, Literal, Optional, Word
import string

code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
    for line in f:
        try:
            result = parser.parseString(line)
            codes = [c[1] for c in result[1:-1]]
            # Do something with teh codez...
        except ParseException exc:
            # Oh noes: string doesn't match!
            continue

Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.

elo80ka
  • 14,837
  • 3
  • 36
  • 43
  • 1
    Thanks for pitching in with a pyparsing solution! Some other options to consider: define code as `Word('c'+string.digits, string.digits)`; then parser can just be `'start:' + delimitedList(code)("codes") + ';'`; the list of codes can be accessed through the results name as `codes = result.codes` -- in general I would keep the definition of things like code as clean as possible, and not mess them up with things like optional comma delimiters; instead add the commas at the next higher level of parser composition. But your parser certainly gets the job done - congrats! – PaulMcG Jan 14 '11 at 07:21
  • @Paul: Nice! Didn't know about `delimitedList` before now, and it totally makes sense that `Literal` be optional. Great stuff...thanks! – elo80ka Jan 14 '11 at 09:53
  • Interesting. I will have to look into pyparsing. Thanks for the post. – Lars Nordin Apr 11 '11 at 17:03
0
import re

sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')

mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
    res = re.findall(slst, match.group(0))

results in

['12354', '3456', '34526']
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99