1

I'm writing app to convert data contained mostly in xml files to static html. At any point in xml, there may be a nested tag like this one:

<t:latex-object url='%28-3%29%5E%7B2%7D%3D3%5E%7B2%7D'><![CDATA[(-3)^{2}=3^{2}]]></t:latex-object>

I have to take the url, generate latex image from it, and replace this tag with img src in html.

What I'm doing right now, is going through entire xml file and generating html output leaving this tags as they are. Next, I wanted to go through entire output, find all occurrences of this tag, generate image for each one, and replace them. But, since url attribute is different every time I can't use replace() function.

I was thinking about using regex, but all I got so far is list of all url attributes and a headache. I was thinking about writing regex which would replace all latex tags with just their url attribute so I could just iterate through my list of urls and replace them with generated images.

Does this kind of approach make any sense? I feel like there should be easier way to do it, not to mention I've spent over an hour trying to write such regex with poor results.

Zibi
  • 350
  • 1
  • 13

2 Answers2

2

Description

This regex will capture the entire tag, and the url attribute. Note this will fail if this tag has nested t tag values.

<t:latex-object\b(?=\s)(?=(?:(?![^>])'[^']*'|"[^"]*"|.)*\surl='([^"]*)').*?<\/t:latex-object>

enter image description here

Python Example

Working example is here http://repl.it/J0t/1, note in the example I'm escaping some of the quotes.

Code

import re

string = """
<t:latex-object url='%28-3%29%5E%7B2%7D%3D3%5E%7B2%7D'><![CDATA[(-3)^{2}=3^{2}]]></t:latex-object>
""";

for matchObj in re.finditer( r'<t:latex-object\b(?=\s)(?=(?:(?![^>])\'[^\']*\'|"[^"]*"|.)*\surl=\'([^"]*)\').*?<\/t:latex-object>', string, re.M|re.I|re.S):
    print "-------"
    print "matchObj.group(0) : ", matchObj.group(0)
    print "matchObj.group(1) : ", matchObj.group(1)

Output

matchObj.group(0) :  <t:latex-object url='%28-3%29%5E%7B2%7D%3D3%5E%7B2%7D'><![CDATA[(-3)^{2}=3^{2}]]></t:latex-object>
matchObj.group(1) :  %28-3%29%5E%7B2%7D%3D3%5E%7B2%7D
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Awesome, thanks a bunch. One question though; is it possible to write regex replace with this pattern that would replace all tags with their url? I know I can easily use replace() now, but it was wondering if it's possible to do in one line. – Zibi Jun 24 '13 at 17:12
  • I think that may be possible, based on that input text what would be the desired new/output string? – Ro Yo Mi Jun 24 '13 at 17:47
  • Well, i wanna replace <![CDATA[(-3)^{2}=3^{2}]]> with %28-3%29%5E%7B2%7D%3D3%5E%7B2%7D – Zibi Jun 25 '13 at 08:53
  • 1
    There are probably better ways to do this (I'm not a python programmer). Try: http://repl.it/J0t/3 – Ro Yo Mi Jun 25 '13 at 13:41
0

Regex is never a good idea to parse XML. It seems to me you should either use a proper XML parser in your Python script. Or use an XSLT.

rectummelancolique
  • 2,247
  • 17
  • 13