0

I've slurped a file into a big string. I wish to parse the string and build up a list of dicts based on jobno. Each job will have a variable number of key/value pairs, in no particular order. The only thing I can count on is a jobno:xxxx pair always denotes the beginning of a new job

python 2.7
import re
bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"

regexJobA = re.compile(r'((\w+):\s(\w+)\s?)', re.DOTALL)
for mo in regexJobA.finditer( bigstr):
  keyy, valu = mo.groups():
  print keyy + ":" + valu

yields

jobno:4859305
jobtype:ASSEMBLY
name:BLUEBALLOON
color:red
jobno:3995433
jobtype:PKG
texture:crunchy

which I could hammer/file/sand/paint to work. But there must be a more elegant regex that would build up the jobs implicitly, something like

regexJobB = re.compile(r'((jobno):\s(\w+)\s?)((*not_jobno*):\s(\w+)\s?)+', re.DOTALL)

would do the trick. But how to negate the (jobno) group? Or use some lookahead/lookbehind/lookaround cleverness to yield

jobno:4859305 jobtype:ASSEMBLY name:BLUEBALLOON color:red
jobno:3995433 jobtype:PKG texture:crunchy

TIA,

code_warrior

code_warrior
  • 77
  • 10
  • Possible duplicate of [Regular expressions: Ensuring b doesn't come between a and c](https://stackoverflow.com/questions/37240408/regular-expressions-ensuring-b-doesnt-come-between-a-and-c) – CertainPerformance Jan 07 '19 at 05:13
  • 1
    eg `jobno:(?:(?!jobno:).)+` https://regex101.com/r/IKSWQN/1 – CertainPerformance Jan 07 '19 at 05:14
  • If you need a regex solution, you may use `regex` PyPi library and [the code like this](https://rextester.com/FJBSKW61207), with `jobno:\s*(?P\w+)(?:\s+(?P\w+):\s*(?P\w+))+?(?=\s+jobno:|$)` (see [demo](https://regex101.com/r/6UgIhY/2/)). Output is `[['color', {'color': 'red', 'name': 'BLUEBALLOON', 'jobtype': 'ASSEMBLY'}], ['texture', {'texture': 'crunchy', 'name': 'SNEAKYPETE', 'jobtype': 'PKG'}]]` – Wiktor Stribiżew Jan 07 '19 at 08:11
  • I see it work on the regex101 website but try as I might – code_warrior Jan 08 '19 at 02:15
  • The regex on the regex101 website gives me exactly what I need (thx!) but refuses to work inside my script. Hm. – code_warrior Jan 08 '19 at 03:05
  • If the last comment is addressed to me, please check [this code](https://rextester.com/FJBSKW61207) where I am using PyPi `regex` module. Are you using the `re` or `regex` module? – Wiktor Stribiżew Feb 04 '19 at 10:09

2 Answers2

0

Using re.findall here seems like something of an improvement over what you currently have:

bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"
result = re.findall('\S+\s*:\s*\S+', bigstr)
print(result)

['jobno: 4859305', 'jobtype: ASSEMBLY', 'name: BLUEBALLOON', 'color: red', 'jobno: 3995433',
    'name: SNEAKYPETE', 'jobtype: PKG', 'texture: crunchy']

At least this avoids having to iterate. My answer assumes that you have a single line input string. If you would need to match across lines, then my answer would change slightly.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
0

You may use

regexJobB = re.compile(r'jobno:\s*(\d+)\s*(.*?)(?=\s+jobno:|$)', re.DOTALL)

See the regex demo. It will let you get separate jobnos, capture their IDs into Group 1 and the rest of the parameters into Group 2. Then, you may either use a second regex to get those params, or just use splitting.

See a Python demo:

import re
bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"

regexJobB = re.compile(r'jobno:\s*(\d+)\s*(.*?)(?=\s+jobno:|$)', re.DOTALL)
for job in regexJobB.finditer(bigstr):
  jobno = job.group(1)
  jobparams = dict(re.findall(r'(\w+):\s*(\w+)', job.group(2)))
  print("No.: {}\nOther params: {}".format(jobno, jobparams))

Output:

No.: 4859305
Other params: {'color': 'red', 'name': 'BLUEBALLOON', 'jobtype': 'ASSEMBLY'}
No.: 3995433
Other params: {'texture': 'crunchy', 'name': 'SNEAKYPETE', 'jobtype': 'PKG'}

The regex matches

  • jobno: - a literal string
  • \s* - 0+ whitespaces
  • (\d+) - Group 1: one or more digits
  • \s* - 0+ whitespaces
  • (.*?) - Group 2: any 0 or more chars as few as possible
  • (?=\s+jobno:|$) - up to the first 1+ whitespaces followed with jobno: or end of string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563