1

I am trying to prefix the TEXT_VALUE field's values by a number in incremental way in all my xml files only the tags called "TRANSL" with ID="Example". Currently I am doing it manually, but since I have several thousands of them, I think I should do it programmatically.

here is the initial version:

<TRANSL ID="Example">
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a680" TXT_TM="a24">
            <TEXT_VALUE>this is an example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a681" TXT_TM="a25">
            <TEXT_VALUE>another example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a682" TXT_TM="a26">
            <TEXT_VALUE>third example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
</TRANS>

and here is the edited version of how it should look like:

<TRANSL ID="Example">
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a680" TXT_TM="a24">
            <TEXT_VALUE>1-this is an example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a681" TXT_TM="a25">
            <TEXT_VALUE>2-another example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
    <TRANSCIPT>
        <REF_TEXT TEXT_ID="a682" TXT_TM="a26">
            <TEXT_VALUE>3-third example</TEXT_VALUE>
        </REF_TEXT>
    </TRANSCIPT>
</TRANS>

how can I do it programmatically? is there any professional xml editors out there? If not, How can I do it in python, or powershell, perl, notepad ++, or any other, for example?

here is my script in python as a notepad ++ plugin:

def increment_replace(match):
    return "<TEXT_VALUE>{}".format(str(int(match.group(1))+1))

editor.rereplace(r'\<TEXT_VALUE\>', increment_replace)

but it is not working...

cplus
  • 1,115
  • 4
  • 22
  • 55

2 Answers2

3

To get the current count/position() of the <TEXT_VALUE> elements you can refer to the count/position() of the parent <TRANSCIPT> element.

To pass this count to the subsequent templates I used the solution from this SO answer and incorporated its approach in the identity template now passing a num parameter containing some value. The num parameter is generated in a <for-each> loop above all <TRANSCIPT> elements and passed down the <apply-templates> hierarchy to be used in the TEXT_VALUE template (everywhere else it's just ignored).

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <!-- modified identity template -->
  <xsl:template match="node()|@*">
    <xsl:param name="num" />
    <xsl:copy>  
      <xsl:apply-templates select="node()|@*">
        <xsl:with-param name="num" select="$num"/>
      </xsl:apply-templates>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="TRANSL">
    <xsl:copy>
      <xsl:apply-templates select="@*" />
      <xsl:for-each select="TRANSCIPT">
        <xsl:copy>
        <xsl:apply-templates>
          <xsl:with-param name="num" select="position()" />
        </xsl:apply-templates>
        </xsl:copy>
      </xsl:for-each>    
    </xsl:copy>
  </xsl:template>

  <xsl:template match="TEXT_VALUE[../../../@ID='Example']">   <!-- added after extension of question -->
    <xsl:param name="num" />
    <xsl:element name="TEXT_VALUE">
      <xsl:value-of select="concat($num,'-',text())" />
    </xsl:element>        
  </xsl:template>

</xsl:stylesheet>

EDIT:
After the requirements have been extended in a comment I added a predicate to the TEXT_VALUE template modifying the matching rule to only select TEXT_VALUE elements which have an @ID attribute with the value "Example".

Community
  • 1
  • 1
zx485
  • 28,498
  • 28
  • 50
  • 59
  • how should I run this? i am new to this environment... how can I run this to do this replacing job in all my files? – cplus Mar 18 '17 at 14:21
  • @cplus: You can use any XSLT-1.0 processor for this job. Which one depends on your OS. On any Linux variant you can use e.g. `xsltproc` which is often part of the default installation. But there are many others for Windows and Mac, too. An OS independent one is [Saxon](http://saxon.sourceforge.net/) which is written in Java and also XSLT-2.0 capable. – zx485 Mar 18 '17 at 14:27
  • @cplus: If you pass the filenames via a _bash script_/_batch file_ this task should be done in a minute. – zx485 Mar 18 '17 at 14:29
  • by the way, shouldn't the `` be this one instead `` ? or i am missing something? – cplus Mar 18 '17 at 14:30
  • @cplus: That depends on which node you want to replace. In the example given your want to replace the `TEXT_VALUE/text()` node by another value - so it is designed to replace that node. (I admit that this template is beyond beginner's scope). – zx485 Mar 18 '17 at 14:32
  • Actually, I want to do the replace job only in TRANSL nodes that have ID="Example", not all TRANSL tags. if i am not mistaken, this is doing to all of the TRANSL s, right? – cplus Mar 18 '17 at 14:34
  • @cplus - this answer is an [XSLT](https://www.w3.org/TR/xslt) script, a special-purpose language designed to transform XML files. In Python or any other general-purpose language (Java, PHP, VB) you can use a library that can run XSLT on XML sources. In Python, you can use the third-party module, `lxml`, to run XSLT 1.0 scripts iterating through all thousands of files. – Parfait Mar 18 '17 at 14:40
  • @cplus: I extended my answer. – zx485 Mar 18 '17 at 14:42
  • thanks for this comprehensive and accurate answer. Is it possible to give some indications on how to do it under bash for a folder that contains all of the desired files? thanks in advance – cplus Mar 18 '17 at 14:51
  • 1
    Yes. Use `for i in *.xml; do xsltproc xsltfilename.xslt $i; done;` with _xsltfilename_ denoting the XSLT code of my answer. – zx485 Mar 18 '17 at 14:55
  • @zx485 should this be .sh file or .bat file? it says that "i was unexpected at this time." i am using .bat file.. please help. thanks – cplus Mar 18 '17 at 15:32
  • @cplus:In your last comment you asked for a bash command, so I gave you one. – zx485 Mar 18 '17 at 15:34
  • @zx485 thanks it is working, but it is not saving (overwriting the files, it is just displaying on the terminal... do you have any idea of how to overwrite them..? – cplus Mar 18 '17 at 15:53
  • 1
    [xsltproc's](http://xmlsoft.org/XSLT/xsltproc2.html) `-o` argument saves transformation to file. – Parfait Mar 18 '17 at 15:55
  • 1
    @cplus: @Parfait is right. So use something like `for i in *.xml; do xsltproc xsltfilename.xslt -o $i.res $i; done;` to convert all `.xml` files to `.xml.res` files. – zx485 Mar 18 '17 at 15:59
2

To add to @zx485 with a variant XSLT script using the count(preceding-sibling::*), consider the following Python solution using lxml. As information, XSLT being a special-purpose language to transform XML files can be a handy tool to manipulate your initial XML files to final end use format.

With Python being a general-purpose languge, you can leverage its os filesystem module and third-party module lxml (a fully-compliant W3C library with XPath 1.0 and XSLT 1.0 capability) to iteratively create the needed outputs.

XSLT (save as .xsl file to be parsed in Python)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>

  <!-- Identity Transform -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- Add Incremenet Number to Text -->
  <xsl:template match="TEXT_VALUE[ancestor::TRANSL/@ID='Example']">
    <xsl:copy>      
      <xsl:value-of select="concat(count(ancestor::TRANSCIPT/preceding-sibling::TRANSCIPT)+1, '-', text())"/>
    </xsl:copy>
  </xsl:template>

</xsl:transform>

Python

import os
import lxml.etree as et

# CHANGE DIRECTORY
os.chdir('/path/to/raw/XML/files')

# LOAD XSLT SCRIPT AND INITIALIZE TRANSFORMER
xslt = et.parse('/path/to/XSLT_Script.xsl')
transform = et.XSLT(xslt)

for file in os.listdir():
   if file.endswith('.xml'):

      # LOAD SOURCE XML
      dom = et.parse(file)

      # TRANSFORM TO NEW TREE
      newdom = transform(dom)

      # SAVE TO FILE (SAME NAME WITH _new SUFFIX)
      with open(file.replace('.xml', '_new.xml'), 'wb') as f:
          f.write(newdom)
Parfait
  • 104,375
  • 17
  • 94
  • 125