2

I've got a string (Python 2.7.3) which is rendered as a template in Django but I don't think this is specific to Django. The string comes from the document.xml file inside a docx file. I'm extacting the document xml rendering it and putting it back inside the docx for some simple mail merge type stuff.

One of the issues, other than the obvious limitations to what template tags I can use, is that Word likes to drop in a whole bunch of xml if you edit the text in Word.

For my needs, I'd be successful if I could

  1. find all occurrences of " between double curly braces and replace with a quote ".

I'd like to replace the " with " in something like the following:

word_docxml = 'some text here {{form.letterdate|date:"Y-m-d"}} and more text'

I was reading over these:

but having trouble putting it together.

  1. How do I remove/strip everything inside and including the < > in between {{ }}'s in a mess like the following:

    <w:rPr>
      <w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
      <w:color w:val="00000A"/>
      <w:sz w:val="22"/>
      <w:szCs w:val="22"/>
      <w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
    </w:rPr>
    <w:t>{{form.</w:t>undefined</w:r>undefined<w:r>
    <w:rPr>
      <w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
      <w:b w:val="false"/>
      <w:bCs w:val="false"/>
      <w:color w:val="00000A"/>
      <w:sz w:val="22"/>
      <w:szCs w:val="22"/>
      <w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
    </w:rPr>
    <w:t>L</w:t>undefined</w:r>undefined<w:r>
    <w:rPr>
      <w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
      <w:color w:val="00000A"/>
      <w:sz w:val="22"/>
      <w:szCs w:val="22"/>
      <w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
    </w:rPr>
    <w:t>etterDate.value|date:"Y-m-d"}}</w:t>undefined</w:r>
    

which would result in the following (apologies, I can't seem to highlight the area of interest):

<w:rPr>
  <w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/>
  <w:color w:val="00000A"/>
  <w:sz w:val="22"/>
  <w:szCs w:val="22"/>
  <w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/>
</w:rPr>
<w:t>{{form.LetterDate.value|date:"Y-m-d"}}</w:t>undefined</w:r>

How does one handle this? Is regex the way to go; if so, how to put the command together?

This is not a duplicate of Between double curly braces: replace particular text because it has no mention of handling a double curly brace for start and end for the search range (that was my real problem, I've read through many examples and was unable to get the pattern for substitution formatted correctly). The other post is about parsing a subset of html entities in XHTML; there is no XHTML parsing required, mentioned or questioned in my post. This post here asks how to remove and/or replace a repeating pattern between two other known start/end patterns. I provided a brief background, two concrete examples from the simple to the complex hoping to learn how to accomplish my current task - my best hope was to get part A explained and apply the method myself to part B. I got intelligent discussion and super replies from helpful members of the community. My post doesn't involve HTML at all as the template I'm rendering in Django is added back to a docx archive and saved to a filestore. It is not a duplicate (of the marked duplicate anyhow).

Community
  • 1
  • 1
AMG
  • 1,606
  • 1
  • 14
  • 25
  • `re.sub('\&quot', '\"', s)` – Cory Kramer Aug 16 '15 at 19:59
  • 2
    The question is a duplicate of what? The question is to replace something specific in between curly braces, and has nothing to do with HTML other than also being part of a template language. Why would it not be appropriate to answer this with a solution if the author has 1000 of these files and needs to sort it out? Just tossing that post around, however funny it is, doesn't make it right! Show me any part of this post that even mentions HTML besides comments to that. – melwil Aug 16 '15 at 22:49
  • @melwil I agree with melwil. We have a saying that you don't need an elephant gun to shoot a mosquito. Solve a problem with the simplest tool you have in your toolbox. A Html parser is not too difficult to use but I am _sure_ there is some yak shaving necessary _compared_ to just fire up a regex that solves the problem adequately and is available in any language out of the box. Don't use the elephant shotgun for this simple problem, that is if you already master regexes or want to improve of course. – buckley Aug 17 '15 at 09:21
  • 1
    "They" removed the status of duplicate? A win for Stackoverlfow !, bravo :) – buckley Aug 17 '15 at 09:24
  • Even after my flag was declared "not helpful"! I had given up, but I guess someone finally saw reason. There is no way to fix this issue with an HTML parser anyway, it's not HTML. :p – melwil Aug 17 '15 at 10:48

2 Answers2

1

Yes, regex is great for this!

a) Use this:

 re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("&quot;", '"', m.group(1)), word_docxml)

Results:

>>> word_docxml = 'some text here {{form.letterdate|date:&quot;Y-m-d&quot;}} and &quot; more text'
>>> re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("&quot;", '"', m.group(1)), word_docxml)
'some text here {{form.letterdate|date:"Y-m-d"}} and &quot; more text'

b) More of the same, just matching different content inside the braces;

re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("<[^>]+>", "", m.group(1)), s)

Results:

>>> s = """<w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>{{form.</w:t></w:r><w:r><w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><e"/><w:bCs w:val="false"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>L</w:t></w<w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US"-US" w:bidi="ar-SA"/></w:rPr><w:t>etterDate.value|date:"Y-m-d"}}</w:t></w:r>"""
>>> re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("<[^>]+>", "", m.group(1)), s)
'<w:rPr><w:rFonts w:eastAsia="Times New Roman" w:cs="Arial" w:ascii="Arial" w:hAnsi="Arial"/><w:color w:val="00000A"/><w:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="en-US" w:eastAsia="en-US" w:bidi="ar-SA"/></w:rPr><w:t>{{form.LetterDate.value|date:"Y-m-d"}}</w:t></w:r>'

Explanation, since you asked for guidance, not just the answer;

re.sub(r"(\{\{[^}]+}\})", lambda m: re.sub("&quot;", '"', m.group(1)), word_docxml)

The way this works is to first match a double brace interval. The lambda expression just takes the group found in that match and does the replace of the relevant content.

The smaller regexes explained:

&quot;     # Just matching that, nothing fancy

A pattern to match tags;

<     # Opening of tag
[^>]+ # Followed by 1 or more characters that are not closing tags
>     # Followed by a closing tag
melwil
  • 2,547
  • 1
  • 19
  • 34
  • I think your first regex has allows for some false positives. I think I addressed this in my answer. As with every regex solution the degree of completeness (complexity) can be lowered if one knows the what can occur or not as input. – buckley Aug 16 '15 at 21:54
  • 1
    Given the fact that these are django templates he's working with, there won't be any single braces inside the double ones. I agree with you on the point that it would produce false positives in some cases you presented, though. – melwil Aug 16 '15 at 21:59
  • Right, the constraints introduced by django makes the false positives illegal I suppose – buckley Aug 16 '15 at 22:02
  • Even so, you make a valid point, and there are single braces employed by django to run statements in the format: `{% url "parameter" %}`for instance. I've updated my answer to match in two turns, like you suggested. – melwil Aug 16 '15 at 22:07
  • Your first solution that used the shortcut to just ignore the opening tags simplified it nicely though. The lambda feature is the one that I was hinting at in my solution (.net has "inline functions" too, very nice). I am finding that multiple passes using lambdas makes regex solutions more readable/maintainable which is _always_ better. The alternative of embedding regexes introduce even more meta characters to counter things that these sub regexes match too much. Trying to do everything with one regex is not necessary and discourages regex newcomers or maintainers. – buckley Aug 16 '15 at 22:16
  • Yes, you are absolutely right about readability. My first was a quick measure to just fix his problem, and I'm pretty sure there would be few problems with it. While inline funtions are awesome, many have trouble understanding how they work. This means that ironically, while looking a lot cleaner, this second solution may be harder to wrap ones head around. – melwil Aug 16 '15 at 22:20
  • 2
    In the end while lambdas takes some time to get used to I am convinced that the student's energy is best put in them instead of crafting or staring at a 20+ character regex that one cannot reason about easily and allows for false or negative positives popping up in places or time where its unexpected. It's a technique that the student can also leverage in lots more places. I would say bite the bullet :) Lambdas will make one a better programmer as well as a better ambassador for regexes. – buckley Aug 16 '15 at 22:27
  • Of course there are false positives. [You can't use regex to parse html](http://stackoverflow.com/a/1732454/282912) including html templates. – msw Aug 16 '15 at 22:29
  • 2
    @msw This has nothing to do with parsing HTML. At all. Just linking that makes no sense here, and hints at the fact that you may not have understood what "parsing" actually means in that post's context. – melwil Aug 16 '15 at 22:29
  • 2
    @msw Just to be clear, that post explains why parsing HTML is a bad idea, because there is no rigid structure in HTML, browsers are equipped to handle and ignore many mistakes in HTML that is often written by humans, which makes HTML a fragile structure. This question was about handling django template code, which does have _very_ rigid structure and produces errors if done wrong. It's also not wrong to use regex to fetch smaller parts of HTML code, as long as you don't try to read the DOM structure in its entirety. – melwil Aug 16 '15 at 22:38
  • @melwil wow. sometimes a fella is looking for pointers and gets a wonderfully comprehensive reply and great discussion between people obviously in the 'know'. thank you to you too buckley! I'm digging through it now. Why does the first closing curly brace in (\{\{[^}]+}\}) not require a preceding backslash? – AMG Aug 16 '15 at 23:13
  • 1
    `}` is a special character in regex, and would need to be escaped when used to match literal characters. The reason why there is no need for the first one is that it is inside a character class, and most special characters lose or have different meaning there. In the case of most, they simply match the character they really represent. – melwil Aug 16 '15 at 23:15
  • 1
    @AMG Since this question was reopened against all odds, I'd appreciate it if you marked off an accepted answer here. I don't think I've ever spent so much time on any single question! :p – melwil Aug 17 '15 at 10:50
  • @melwil sorry, meant to mark it as solved - sideline project which I had to walk away from for a bit, but the solution worked perfectly. – AMG Aug 18 '15 at 01:25
0

One must be careful when testing a regex that it doesn't match too much (false positives). Given your complex input this becomes more important.

For example, a regex should not match

&quot;

below

test { &quot; }}text
test  &quot; }}

As for your second question I would do it in 2 passes to keep the regex nice 'n simple

First use this regex to match content between {{ and }}

\{\{(.*?)\}\}

Now apply a function to only the contents of group 1. I am familiar with .NET which allows this and I hope your language does too

The function to apply is a again a replacement regex with nothing

<[^>]*>

I hope I got the Python dialect right.

The first question can use the same idea.

buckley
  • 13,690
  • 3
  • 53
  • 61