Find and Replace in Python- based on unknown characters

Question

I've been stumped on finding a way to find and replace characters based on position. Basically what I am looking to do it go into a document and replace

<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>

With

<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>

Everything after the decimal character must be deleted. The issue is, this is for multiple time stamps in XML files, and each of these time stamps are totally different. I've read a bit on regex and it seems like a possible method. Any help would be greatly appreciated.

Edit Example of XML file format:

<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?>
<gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd">
    <gmd:fileIdentifier>
        <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString>
    </gmd:fileIdentifier>
    <gmd:language>
        <gco:CharacterString>English</gco:CharacterString>
                <gmd:date>
                <gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>
                </gmd:date>

@Parfait

Regexes will solve this and other similar problems and you should keep reading about them. In this specific case parsing and formatting dates is also a good approach. — Alex Hall, Jun 07 '16 at 22:39
I would further caution you against trying to process XML much without actually parsing it into a proper tree using a library such as `lxml` or `ElementTree`, though you might get away with it if all your transormations are as uncomplicated. — holdenweb, Jun 07 '16 at 22:43
It cannot be emphasized enough (perhaps the highest voted answer on SO), [do not regex html/xml files](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). — Parfait, Jun 08 '16 at 00:25

score 0 · Answer 1 · answered Jun 07 '16 at 22:46

0

One way:

s = "<gco:DateTime>2016-04-20T11:27:34.8677919-06:00</gco:DateTime>"
split_on_dot = s.split('.')
split_on_angle = split_on_dot[1].split('<')
new_s = "".join([split_on_dot[0], "<", split_on_angle[1]])

>>> new_s
'<gco:DateTime>2016-04-20T11:27:34</gco:DateTime>'
>>>

This depends on the period being the only period in the input string. I'm not so good at regexes. I think they get overused, but i'm sure someone will show you how using regex. Just remember that python has good string manipulations natively.

answered Jun 07 '16 at 22:46

joel goldstick

4,393
6
30
46

Thanks joel, I would need this to be able to parse through multiple unknown dates for each file. There are about 6 date stamps with this format in each file. And the format is consistent through each, with only one period being used. – MapZombie Jun 07 '16 at 22:54
Then, good, but heed @holdenweb comments about xml parsing. My answer just takes care of things once you have the element you want to change. Stephen Holden introduced me to python in a course he taught – joel goldstick Jun 07 '16 at 23:04

Parfait · Accepted Answer · 2016-06-08T18:16:34.147

Consider XSLT (the special-purpose declarative language designed to transform XML documents) which has a very convenient function (shared with its sibling, XPath) for your needs substring-before() where you extract the data prior to the period demarcating the timestamp. Python's lxml module can run XSLT 1.0 scripts.

Below script parses XML and XSLT from file. Specifically, the XSLT runs the Identity Transform to copy document as is and then extracts the time from all <gco:DateTime>. Note that only the needed gco namespace is defined in the XSLT header:

XSLT Script (save as externally as .xsl file to be referenced in Python)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:gco="http://www.isotc211.org/2005/gco">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="gco:DateTime">
    <xsl:copy>
      <xsl:copy-of select="substring-before(., '.')"/>                  
    </xsl:copy>
  </xsl:template>

</xsl:transform>

Python Script

import lxml.etree as ET

# LOAD XML AND XSL
dom = ET.parse('Input.xml')
xslt = ET.parse('XSLTScript.xsl')

# TRANSFORM XML 
transform = ET.XSLT(xslt)
newdom = transform(dom)

# CONVERT TO STRING
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)

# OUTPUT TREE TO FILE
xmlfile = open('Output.xml')
xmlfile.write(tree_out)
xmlfile.close()

Output

<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='http://ngis/ngis/metadata/StyleSheet/xslt/nGIS_Metadata.xslt'?><gmd:MD_Metadata xmlns:gml="http://www.opengis.net/gml/3.2" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:gfc="http://www.isotc211.org/2005/gfc" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:gss="http://www.isotc211.org/2005/gss" xmlns:gsr="http://www.isotc211.org/2005/gsr" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmi="http://www.isotc211.org/2005/gmi" xmlns:gmd="http://www.isotc211.org/2005/gmd">
  <gmd:fileIdentifier>
    <gco:CharacterString>BF244A7CB62491BC74B001BE5DEAA213AAFB9DBA</gco:CharacterString>
  </gmd:fileIdentifier>
  <gmd:language>
    <gco:CharacterString>English</gco:CharacterString>
    <gmd:date>
      <gco:DateTime>2016-04-20T11:27:34</gco:DateTime>
    </gmd:date>
  </gmd:language>
</gmd:MD_Metadata>

Thanks Parfait, this is working great. Really appreciate it! — MapZombie, Jun 08 '16 at 16:14
Please post snippet of actual xml (all of its headers as you have a namespace, `gco`, that should be defined). And you should not need to start from third line. — Parfait, Jun 08 '16 at 17:11
Just posted a sample xml in my original question. I'm hoping to iterate this script through all folders with XML files. And most of the XMLs have the same general format. — MapZombie, Jun 08 '16 at 17:37
lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document, line 2, column 16 is the error I get when pasting the entire the XML — MapZombie, Jun 08 '16 at 18:04
As mentioned, you can parse both XML and XSL from file. See edit walking you through the steps. Simply iterate other files from directory. — Parfait, Jun 08 '16 at 18:17
Excellent thanks. I ran into an error: IOError: File not open for writing but modified your xmlfile output with r+ and it is now working # OUTPUT TREE TO FILE xmlfile = open('Z:\FILENAME.xml',"r+") — MapZombie, Jun 08 '16 at 19:08
Great. Interesting error, must be a python 2.7 vs 3.4 difference. Please accept answer to confirm resolution. — Parfait, Jun 08 '16 at 19:41

Find and Replace in Python- based on unknown characters

2 Answers2