Python Regex look behind

Question

I have the following text:

<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>

I need to grab this portion out:

d="M 112.176 307.8 
L 112.176 307.8 
L 174.672 270 
L 241.632 171.72 
L 304.128 58.32 
L 380.016 171.72 
L 442.512 217.08 
L 491.616 141.48 
L 491.616 307.8 
z
"

I need to replace this section with something else. I was able to grab the entirety of <clipPath ...><path d="[code i want]"/> but this doesn't help me because I can't override the id in the <clipPath> element.

Note that there are other <clipPath> elements that I do not want to touch. I only want to change <path> elements within <clipPath> elements.

I'm thinking that the answer has to do with selecting everything before a clipPath element and ending at the Path section. Any help would be entirely appreciated.

I've been using http://pythex.org/ for help and have also seen odd behavior (having to do with multiline and spaces) that don't act the same between that and python 3.x code.

Here are some of the things I've tried:

reg = r'(<clipPath.* id=".*".*>)'
reg = re.compile(r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")')
reg = re.compile(r'((?<!<clipPath).* id=".*".*>\s*<path.*d="(.*\n)+")')

g = reg.search(text)
g

see also http://stackoverflow.com/questions/15857818/python-svg-parser — kennytm, Jan 27 '17 at 20:06
I think, regardless whether it is nested, you better use a tool like beautifiulsoup... — Willem Van Onsem, Jan 27 '17 at 20:06
you are using the wrong capture group, and enclosing the whole statement. — Aaron, Jan 27 '17 at 20:07
This: http://stackoverflow.com/questions/15857818/python-svg-parser might work for my needs! I still would like to figure out how to do the regex though, since it got me caught up for so long and it's a good general ability to have. — Dan, Jan 27 '17 at 20:10
is this an `xml` ? why won't you do this with `xml.etree.ElementTree` or `lxml` ? — PYPL, Jan 27 '17 at 20:11
I always use [regex101](https://regex101.com/) it's a huge lifesaver — Aaron, Jan 27 '17 at 20:11

Jean-François Fabre · Answer 1 · 2017-01-27T21:32:07.583

3

regex is never the proper way of parsing xml.

Here's a simple standalone example which does it using lxml:

from lxml import etree

text="""<clipPath id="p54dfe3d8fa">
   <path d="M 112.176 307.8
L 112.176 307.8
L 174.672 270
L 241.632 171.72
L 304.128 58.32
L 380.016 171.72
L 442.512 217.08
L 491.616 141.48
L 491.616 307.8
z
"/>
  </clipPath>
  <clipPath id="p27c84a8b3c">
   <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
  </clipPath>"""

# This creates <metrics>
root = etree.XML("<X>"+text+"</X>")
p = root.find(".//path")
print(p.get("d"))

result:

M 112.176 307.8 L 112.176 307.8 L 174.672 270 L 241.632 171.72 L 304.128 58.32 L 380.016 171.72 L 442.512 217.08 L 491.616 141.48 L 491.616 307.8 z

first, I create the main node. Since there are several nodes, I wrap it in an arbitrary main node
then I look for "path" anywhere
once found, I get the d attribute

Now I'm changing the text for d and dump it:

p.set("d","[new text]")
print(etree.tostring(root))

now the output is like:

...
<path d="[new text]"/>\n
...

still, quick and dirty, maybe not robust to several path nodes, but works with the snippet you provided (and I'm no xml expert, just fumbling)

BTW, another hacky/non-regex way of doing it: using multi-character split:

text.split(' d="')[1].split('"/>')[0]

taking the second part after d delimiter, then the first part after /> delimiter. Preserves the multi-line formatting.

edited Jan 27 '17 at 21:32

answered Jan 27 '17 at 20:13

Jean-François Fabre

137,073
23
153
219

Nice for advising a solution without regexes. +1. – Willem Van Onsem Jan 27 '17 at 20:14
never say never ;) it may be better practice to use `lxml` or similar, but the author also stated he wanted to learn regex – Aaron Jan 27 '17 at 20:14
@Aaron I understand that, then the OP should practice on something else than a nested syntax language with several possible syntaxes, like for instance a text file with line-per-line data. If OP wanted to parse C or Java with regex, it would be foolish as well. – Jean-François Fabre Jan 27 '17 at 20:16
(Plus since I don't know zit about xml and I could find my way by trial and error in a few minutes, that could convince more xml newbies that it's not that hard to use, even if I personally _hate_ xml) – Jean-François Fabre Jan 27 '17 at 20:18
1

In my personal experience, I sometimes run into poorly formed xml that requires some regex lovin' – Aaron Jan 27 '17 at 20:19
@Aaron: that's different: if the xml is broken, then regexes are useful to fix it. – Jean-François Fabre Jan 27 '17 at 20:20
Thank you all for the help. This was very good discussion and I learned a lot from it. :) – Dan Jan 27 '17 at 20:56

score 2 · Accepted Answer · answered Jan 27 '17 at 20:38

TL;DR: r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

let's break that down...

you started with: r'(<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+")' which enclosed your entire capture pattern inside a group, so the whole element would be captured in the match object. Let's take out those parenthesis: r'<clipPath.* id=".*".*>\s*<path.*d="(.*\n)+"'

next you seem to use .* quite often, which can be dangerous because it is blind and greedy. for the clipPath id, if you know the id is always alphanumeric, a better solution might be r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d="(.*\n)+"'

finally, let's look at what you actually want to capture. your example shows you want to capture the quotation marks, so let's get those inside our capture group: ...*d=("(.*\n)+"). This leaves us with a weird nested group situation though, so let's make the inner group non-capturing: ...*d=("(?:.*\n)+").

now we're capturing what you want, but we still have a problem... what if there are multiple elements that satisfy these criteria? the greedy matching of the + in ...*d=("(.*\n)+") will capture ever line in-between. What we can do here is to make the + non greedy by following it with a ?: ...*d=("(?:.*\n)+?").

put all these things together:

r'<clipPath.* id="[a-zA-Z0-9]+".*>\s*<path.*d=("(?:.*\n)+?")'

This pretty much directly answers my question using a regex and that's awesome! I believe that one of the SVG/XML parsing libs is what I'm going to ultimately go with but this is going to get marked as correct because of your effort and explanation. Thanks! — Dan, Jan 27 '17 at 20:52

score 1 · Answer 3 · answered Jan 27 '17 at 20:36

An xml based solution that edits the path.

import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parseString('<X>' + my_xml + '</X>')
collection = DOMTree.documentElement
for clip_path in collection.getElementsByTagName("clipPath"):
    paths = clip_path.getElementsByTagName('path')
    for path in paths:
        path.setAttribute('d', '[code i want]')

print DOMTree.toxml()

Data used:

my_xml = """
    <clipPath id="p54dfe3d8fa">
       <path d="M 112.176 307.8
    L 112.176 307.8
    L 174.672 270
    L 241.632 171.72
    L 304.128 58.32
    L 380.016 171.72
    L 442.512 217.08
    L 491.616 141.48
    L 491.616 307.8
    z
    "/>
      </clipPath>
      <clipPath id="p27c84a8b3c">
       <rect height="302.4" width="446.4" x="72.0" y="43.2"/>
      </clipPath>
"""

This is pretty much what I ended up doing. Thank you! – Dan Jan 27 '17 at 20:55 — Dan, Jan 27 '17 at 20:55

Python Regex look behind

3 Answers3