Replacing multiple strings with regex in python for a file giving truncated string

Question

The following python code

import xml.etree.cElementTree as ET
import time
import fileinput
import re

ts = str(int(time.time()))
modifiedline =''
for line in fileinput.input("singleoutbound.xml"):
    line = re.sub('OrderName=".*"','OrderName="'+ts+'"', line)
    line = re.sub('OrderNo=".*"','OrderNo="'+ts+'"', line)

    line = re.sub('ShipmentNo=".*"','ShipmentNo="'+ts+'"', line)

    line = re.sub('TrackingNo=".*"','TrackingNo="'+ts+'"', line)

    line = re.sub('WaveKey=".*"','WaveKey="'+ts+'"', line)
    modifiedline=modifiedline+line

Returns the modifiedline string with some lines truncated wherever the first match is found

How do I ensure it returns the complete string for each line?

Edit:

I have changed the way I am solving this problem, inspired by Tomalak's answer

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*'):
    if 'OrderName' in elem.attrib:
        elem.attrib['OrderName'] = ts   
    if 'OrderNo' in elem.attrib:
        elem.attrib['OrderNo'] = ts
    if 'ShipmentNo' in elem.attrib:
        elem.attrib['ShipmentNo'] = ts
    if 'TrackingNo' in elem.attrib:
        elem.attrib['TrackingNo'] = ts
    if 'WaveKey' in elem.attrib:
        elem.attrib['WaveKey'] = ts


doc.write('singleoutbound_2.xml')

You are using regular expressions to replace parts of XML? Scrap your code, start over. Modifications on *ML should be done with a proper tool, in your case with a DOM API (or with XSLT). The ElementTree module you import is a proper tool, but I don't see you using it anywhere. — Tomalak, Aug 24 '16 at 15:04
It's not clear what your expected behavior actually is. Can you provide a sample singleoutbound.xml with your question, the actual output that your code generates, and the desired output that you want your code to produce? Also, I'll point out that your code as written doesn't return *anything*. You construct modifiedline, but do not output it, store it or return it. — Matthew Cole, Aug 24 '16 at 15:15
This `'ShipmentNo="'+ts+'"'` looks like runtime replacement string. I think the replacement string expects a compile time string. Does this work with no exceptions? — , Aug 24 '16 at 15:17
@sln: I generated a simple XML file with one match for each of the five regexs given, and it consumed it with no exceptions thrown. — Matthew Cole, Aug 24 '16 at 15:19
Disregard question, apparently it can be used in runtime but I don't know how they could do that without an internal _eval_ of the code. http://www.dotnetperls.com/sub-python — , Aug 24 '16 at 15:21
Anyway, the regex is peculiar. Anything `.*` will consume to the end of string, then start backtracking to satisfy surrounding expressions. So `".*"` will consume `"this """""""""""""""""" is a """"""" quote"` — , Aug 24 '16 at 15:25
@Tomalak Thank you, I will try doing it with ElementTree, but could you explain why this isn't suggested or proper? — Praveer N, Aug 25 '16 at 06:06
@MatthewCole Apologies, I did not write the complete code in the question. I am using modifiedline string to make a post request later on in the code. — Praveer N, Aug 25 '16 at 06:10
@Praveer Regex is technically unable to deal with XML. This is a hard limitation, as in **it is actually impossible do do it correctly with regular expressions** (your question here and countless others like it are living proof). If you care for code that does not break at run-time over valid (!) input, you should stop trying to use regular expressions for XML now. See http://stackoverflow.com/questions/701166/, among thousands upon thousands of other posts on the topic all over the Internet. This topic has been discussed to death. — Tomalak, Aug 25 '16 at 06:24
@Tomalak oh, sorry. I should have tried googling it first! I am a beginner (this was my first attempt at coding in python as well as regex) and thought this would be more efficient. I will try solving this with cElementTree instead. — Praveer N, Aug 25 '16 at 06:31

Tomalak · Accepted Answer · 2016-08-25T07:07:50.093

1

Here is how to use ElementTree to make modifications to an XML file without accidentally breaking it:

import xml.etree.cElementTree as ET
import time

ts = str(int(time.time()))

doc = ET.parse('singleoutbound.xml')

for elem in doc.iterfind('//*[@OrderName]'):
    elem.attrib['OrderName'] = ts

# and so on

doc.write('singleoutbound_2.xml')

Things to understand:

XML represents a tree-shaped data structure that consists of elements, attributes and values, among other things. Treating it as line-based plain text fails to recognize this fact.
There is a language to select items from that tree of data, called XPath. It's powerful and not difficult to learn. Learn it. I've used //*[@OrderName] above to find all elements that have an OrderName attribute.
Trying to modify the document tree with improper tools like string replace and regular expressions will lead to more complex and hard-to-maintain code. You will encounter run-time errors for completely valid input that your regex has no special case for, character encoding issues and silent errors that are only caught when someone looks at your program's output. In other words: It's the wrong thing to do, so don't do it.
The above code is actually simpler and much easier to reason about and extend than your code.

edited Aug 25 '16 at 07:07

answered Aug 25 '16 at 06:51

Tomalak

332,285
67
532
628

Thank you Tomalak, I have edited the question text with the new code inspired by your answer, which I am using to solve my problem! – Praveer N Aug 25 '16 at 09:45
@PraveerN That code is looking very good. That's the way to go. You can also make a `for attribName in ['OrderName', 'OrderNo', 'etc']` loop instead of copying the lines. – Tomalak Aug 25 '16 at 09:48
Parsing and writing a document with etree may change the document a bit - although there shopuld not be any semantic changes, there may be other changes as etree does a bit of normalization eg on CDATA – janbrohl Aug 25 '16 at 12:15
@jan That's right. There are multiple ways of representing an in-memory tree as serialized XML. That's the core point of my argument. Don't rely on the textual representation, it's ephemeral. Think of it as a transport container between parsers. Don't write tools that rely on the textual representation. A parser will give you the *actual* data, whether it was in a CDATA or not, escaped or not, broken over multiple lines or not, etc. – Tomalak Aug 25 '16 at 13:01
There are parsers that handle CDATA [different](https://en.wikipedia.org/wiki/CDATA#Uses_of_CDATA_sections) from non-CDATA and possibly there are other quirks - so while it is clear that running stuff through etree will not break other well implemented applications parsing the data it might bring problems for apps with problematic use of correct parsers (very impropable). – janbrohl Aug 25 '16 at 13:16
*"There are parsers that handle CDATA different from non-CDATA"* - which ones? *"and possibly there are other quirks"* - Sorry, that's FUD. These considerations are not to be made in advance, but when you actually hit a limitation of that kind, which is so improbable that saying "impossible" is close enough. – Tomalak Aug 25 '16 at 14:20

janbrohl · Answer 2 · 2016-08-25T13:24:10.593

0

Do not use Regexes for parsing XML if you don't have an important reason for doing so

* does greedy matching but what you actually seem to want is *? for not matching until the last " in the line but the next ".

So just replace each * with *? in your cone and you should be fine (apart from the usual do-not-regex-XML-problems).

Edit:

The usual Problem with Regex and XML is that your Regex works fine at first but does not with valid XML from other sources (eg other exporters or even other versions of the same exporter) because there different ways of saying the same thing in XML. Some examples for this are <name att="123"></name> or <name att="123"/> being the same as <name att='123' /> which is the same as this with the 123 &-quoted - this may be the same as <a:name att="123"/> or <b:name att="123"/> depending on namespace-use.

Short:

Actually you cannot be sure that your Regex still works when something that you cannot control changes.

But:

Some parsers may produce unexpected results, too in such cases
Some exporters produce bad XML that normal parsers do not understand correctly so - if they cannot be fixed - workarounds like Regexes are needed.

edited Aug 25 '16 at 13:24

answered Aug 24 '16 at 15:29

janbrohl

2,626
1
17
15

Thank you, this worked for me. Could you elaborate upon _(apart from the usual do-not-regex-XML-problems)_ – Praveer N Aug 25 '16 at 06:11
If you know about the "usual do-not-regex-XML-problems", why do you advice people to do it anyway? – Tomalak Aug 25 '16 at 06:27
The comments below the question already warn about potential problems and there are actually use-cases to use regex with XML like for parsing invalid XML. – janbrohl Aug 25 '16 at 10:20
elaborated on regexing-XML-problems – janbrohl Aug 25 '16 at 10:21
I don't believe there are use-cases where regex is the superior solution to parse XML. Name one, I'm genuinely interested. Generally speaking: XML is an extremely strict format, any well-formed input *will* be parsed properly. Any input that cannot be parsed simply isn't XML - even if it has angle brackets and stuff like that. Instead of tinkering with the consumer, the producer of broken XML should be fixed. If that's not possible, there are specialized tools like tidy to preprocess the document, or more lenient HTML parsers can deal with it. Falling back to regex simply is never necessary. – Tomalak Aug 25 '16 at 10:52
If you know exactly what to expect like a file format that just happens to look (mostly) like XML. An example specification could contsist of statements like this: "the description text is located between `` and ``" - never stating it actually has to produce XML but often (or even always) producing valid XML. If it was `{desc-start}` and `{desc-end}` you would not normally parse it as XML or preprocess it to be parseable as XML but just use Regex - why take a different approach? Of course you can do that or (if nobody else is using that output) *fix* the generator/spec but why? – janbrohl Aug 25 '16 at 11:52
If you know you get valid XML there is no *need* to use regex (btw I am quite sure that many XML parsers use regexes internally eg for detecting tags) - but for well-defined subsets of XML using regexes directly can be simpler and reliable. – janbrohl Aug 25 '16 at 12:00
@Tomalak In this case the etree-solution is just much better than regexing apart from speed/memory consumption (and the fact that other parts of the document may be modified slightly). Using an real parser is better in *most* cases. – janbrohl Aug 25 '16 at 12:07
Of course XML parsers might use regex internally. That's completely irrelevant and misleading, and claiming that *thererefore* you might use regex to parse XML is either dishonest or a result of failing to recognize the complexity of XML and how that complexity rules out regex. Those parsers do not use regex to parse XML (an *irregular* grammar). Those parsers use them to parse very narrow parts of it, like an element name, which can be represented as a *regular* grammar, which is the maximum level of language complexity a regex can recognize (hence the name *regular* expression). – Tomalak Aug 25 '16 at 12:21
Parsing any arbitrarily nested structure with regex is always wrong. Whether that nested structure uses `{}` or `<>` to mark up its constituent parts is besides the point. The fact that it is *nested* is the deal-breaker. If your hypothetical `{desc-start}` / `{desc-end}` structure does not have nesting, regex would be an option. But the complexities in XML are more far-reaching. There is character encoding (i.e. byte layout in the file). There is character escaping, and multiple variants of it, too. There are CDATA sections. Comments. All of this falls apart if you treat XML as plain-text. – Tomalak Aug 25 '16 at 12:29
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121838/discussion-between-janbrohl-and-tomalak). – janbrohl Aug 25 '16 at 12:34

Replacing multiple strings with regex in python for a file giving truncated string

2 Answers2