negating a regex group

Question

I've slurped a file into a big string. I wish to parse the string and build up a list of dicts based on jobno. Each job will have a variable number of key/value pairs, in no particular order. The only thing I can count on is a jobno:xxxx pair always denotes the beginning of a new job

python 2.7
import re
bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"

regexJobA = re.compile(r'((\w+):\s(\w+)\s?)', re.DOTALL)
for mo in regexJobA.finditer( bigstr):
  keyy, valu = mo.groups():
  print keyy + ":" + valu

yields

jobno:4859305
jobtype:ASSEMBLY
name:BLUEBALLOON
color:red
jobno:3995433
jobtype:PKG
texture:crunchy

which I could hammer/file/sand/paint to work. But there must be a more elegant regex that would build up the jobs implicitly, something like

regexJobB = re.compile(r'((jobno):\s(\w+)\s?)((*not_jobno*):\s(\w+)\s?)+', re.DOTALL)

would do the trick. But how to negate the (jobno) group? Or use some lookahead/lookbehind/lookaround cleverness to yield

jobno:4859305 jobtype:ASSEMBLY name:BLUEBALLOON color:red
jobno:3995433 jobtype:PKG texture:crunchy

TIA,

code_warrior

Possible duplicate of [Regular expressions: Ensuring b doesn't come between a and c](https://stackoverflow.com/questions/37240408/regular-expressions-ensuring-b-doesnt-come-between-a-and-c) — CertainPerformance, Jan 07 '19 at 05:13
If you need a regex solution, you may use `regex` PyPi library and [the code like this](https://rextester.com/FJBSKW61207), with `jobno:\s*(?P\w+)(?:\s+(?P\w+):\s*(?P\w+))+?(?=\s+jobno:|$)` (see [demo](https://regex101.com/r/6UgIhY/2/)). Output is `[['color', {'color': 'red', 'name': 'BLUEBALLOON', 'jobtype': 'ASSEMBLY'}], ['texture', {'texture': 'crunchy', 'name': 'SNEAKYPETE', 'jobtype': 'PKG'}]]` — Wiktor Stribiżew, Jan 07 '19 at 08:11
The regex on the regex101 website gives me exactly what I need (thx!) but refuses to work inside my script. Hm. — code_warrior, Jan 08 '19 at 03:05
If the last comment is addressed to me, please check [this code](https://rextester.com/FJBSKW61207) where I am using PyPi `regex` module. Are you using the `re` or `regex` module? — Wiktor Stribiżew, Feb 04 '19 at 10:09

score 0 · Answer 1 · answered Jan 07 '19 at 05:14

Using re.findall here seems like something of an improvement over what you currently have:

bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"
result = re.findall('\S+\s*:\s*\S+', bigstr)
print(result)

['jobno: 4859305', 'jobtype: ASSEMBLY', 'name: BLUEBALLOON', 'color: red', 'jobno: 3995433',
    'name: SNEAKYPETE', 'jobtype: PKG', 'texture: crunchy']

At least this avoids having to iterate. My answer assumes that you have a single line input string. If you would need to match across lines, then my answer would change slightly.

score 0 · Answer 2 · answered Mar 05 '19 at 14:23

You may use

regexJobB = re.compile(r'jobno:\s*(\d+)\s*(.*?)(?=\s+jobno:|$)', re.DOTALL)

See the regex demo. It will let you get separate jobnos, capture their IDs into Group 1 and the rest of the parameters into Group 2. Then, you may either use a second regex to get those params, or just use splitting.

See a Python demo:

import re
bigstr = "jobno: 4859305 jobtype: ASSEMBLY name: BLUEBALLOON color: red jobno: 3995433 name: SNEAKYPETE jobtype: PKG texture: crunchy"

regexJobB = re.compile(r'jobno:\s*(\d+)\s*(.*?)(?=\s+jobno:|$)', re.DOTALL)
for job in regexJobB.finditer(bigstr):
  jobno = job.group(1)
  jobparams = dict(re.findall(r'(\w+):\s*(\w+)', job.group(2)))
  print("No.: {}\nOther params: {}".format(jobno, jobparams))

Output:

No.: 4859305
Other params: {'color': 'red', 'name': 'BLUEBALLOON', 'jobtype': 'ASSEMBLY'}
No.: 3995433
Other params: {'texture': 'crunchy', 'name': 'SNEAKYPETE', 'jobtype': 'PKG'}

The regex matches

jobno: - a literal string
\s* - 0+ whitespaces
(\d+) - Group 1: one or more digits
\s* - 0+ whitespaces
(.*?) - Group 2: any 0 or more chars as few as possible
(?=\s+jobno:|$) - up to the first 1+ whitespaces followed with jobno: or end of string.

negating a regex group

2 Answers2