0

Im using the python code to parse multiple .xml files

import os
import lxml.etree as ET
import sys

inputpath = 
xsltfile = 
outpath = 

dir = []

if sys.version_info[0] >= 3:
    unicode = str

for dirpath, dirnames, filenames in os.walk(inputpath):
    structure = os.path.join(outpath, dirpath[len(inputpath):])
    if not os.path.isdir(structure):
        os.mkdir(structure)
    for filename in filenames:
        if filename.endswith(('.xml')):
            dir = os.path.join(dirpath, filename)
            print(dir)
            dom = ET.parse(dir)
            xslt = ET.parse(xsltfile)
            transform = ET.XSLT(xslt)
            newdom = transform(dom)
            infile = unicode((ET.tostring(newdom, pretty_print=True,xml_declaration=True,standalone='yes')))
            outfile = open(structure + "\\" + filename, 'a')
            outfile.write(infile)

I do have an .xslt template which is used to sort the uuids in the same file.

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" standalone="yes"/>
<xsl:strip-space elements="*"/>

<!-- identity transform -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="uuids">
    <xsl:copy>
        <xsl:apply-templates select="uuid">
            <xsl:sort select="."/>
        </xsl:apply-templates>
    </xsl:copy>
</xsl:template>
</xsl:stylesheet>

Desired Output should be same as source unicode char's but with sortig uuid's in the same file. I see that uuids are sorting fine, but this unicode is changing to numbers which i dont want to. I

sandy
  • 29
  • 6
  • Do you have XML prolog declaration with **encoding** in the input XML? – Yitzhak Khabinsky May 24 '21 at 15:56
  • 1
    XSLT - While asking a question you need to provide a **minimal reproducible example**: (1) Input XML. (2) Your logic, and XSLT that tried to implement it. (3) Desired output. (4) XSLT processor and its version. – Yitzhak Khabinsky May 24 '21 at 16:22
  • I think you should add a `python` tag your question, since the problem is not with your XSLT code, but with the way the output of the XSL transformation is serialized by your calling application. – michael.hor257k May 24 '21 at 16:48

1 Answers1

-1

While asking a question it is a good idea to provide a minimal reproducible example, i.e. XML/XSLT pair.

Please try the following conceptual example.

I am using SAXON 9.7.0.15

It is very possible that the last Python line is causing the issue:

outfile.write(ET.tostring(newdom,pretty_print=True,xml_declaration=True,standalone='yes').decode())

Please try Python last lines as follows:

import sys
if sys.version_info[0] >= 3:
    unicode = str
...
newdom = transform(dom)
infile = unicode((ET.tostring(newdom, pretty_print=True)))
outfile = open(structure + "\\" + filename, 'a')
outfile.write(infile, encoding='utf-8', xml_declaration=True, pretty_print=True)

https://lxml.de/api/lxml.etree._ElementTree-class.html#write

Reference link: How to transform an XML file using XSLT in Python

Input XML

<?xml version="1.0" encoding="UTF-8"?>
<a:ruleInputTestConfigs xmlns:a="URI">
    <a:value xmlns:xsd="http://www.w3.org/2001/XMLSchema"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
             xsi:type="xsd:string">あいうえお@domain.com</a:value>
    <a:nameRef>email</a:nameRef>
    <a:id>1</a:id>
</a:ruleInputTestConfigs>

XSLT

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes" standalone="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- identity transform -->
    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Output XML

<?xml version="1.0" encoding="UTF-8"?>
<a:ruleInputTestConfigs xmlns:a="URI">
    <a:value xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:type="xsd:string"
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">あいうえお@domain.com</a:value>
    <a:nameRef>email</a:nameRef>
    <a:id>1</a:id>
</a:ruleInputTestConfigs>
Yitzhak Khabinsky
  • 18,471
  • 2
  • 15
  • 21
  • I tried changing it to UTF-16, but I still see that its changing the unicode to the digits. – sandy May 24 '21 at 16:30
  • @sandy, did yo see my comment about a **minimal reproducible example**? – Yitzhak Khabinsky May 24 '21 at 16:32
  • @sandy, I updated my answer too. Check it out. You still didn't provide a minimal reproducible example: ##1-4. – Yitzhak Khabinsky May 24 '21 at 16:41
  • its an large xml file and i couldnt paste it here. and for 4. its already in .xslt template ```xsl:stylesheet version="1.0``` – sandy May 24 '21 at 16:44
  • @sandy. We don't need the entire XML file. But we do need its prolog and a root element. There are reasons why it is called a MINIMAL reproducible example. – Yitzhak Khabinsky May 24 '21 at 16:53
  • @sandy, I updated the answer with the Python code suggestions. Please give it a shot. – Yitzhak Khabinsky May 24 '21 at 17:14
  • I just updated the .xml file in the question and I also tried the python change you recommended. Its throwing this error ```NameError: name 'unicode' is not defined``` – sandy May 24 '21 at 17:22
  • @sandy, it is all explained here: https://stackoverflow.com/questions/19877306/nameerror-global-name-unicode-is-not-defined-in-python-3 I updated the Python snippet. – Yitzhak Khabinsky May 24 '21 at 17:28
  • @sandy, Are you getting the `あいうえお@domain.com` correctly? – Yitzhak Khabinsky May 24 '21 at 17:43
  • I tried adding unicode using str. But I still see the output with the number. – sandy May 24 '21 at 17:44
  • This is the output im getting. ```<!あいうえお@domain.com>``` – sandy May 24 '21 at 17:45
  • @sandy. I am running out of ideas. As I see it, the XSLT is working correctly. The issue is clearly on the Python side. – Yitzhak Khabinsky May 24 '21 at 17:46
  • Check it out here: https://lxml.de/3.6/FAQ.html "... What is the difference between str(xslt(doc)) and xslt(doc).write() ? "...If you call str(), it will return the serialized result as specified by the XSL transform. This correctly serializes string results to encoded Python strings and honours xsl:output options like indent. This almost certainly does what you want, so you should only use write() if you are sure that the XSLT result is an XML tree and you want to override the encoding and indentation options requested by the stylesheet. ..." – Yitzhak Khabinsky May 24 '21 at 17:57
  • I have added the encoding=UTF-8 to the in ```tostring(..., encoding="utf-8)```, second in ```open(..., encoding="utf-8)```. Now its giving the output as required – sandy May 24 '21 at 18:06
  • @sandy, good to hear good news. Please close the case. – Yitzhak Khabinsky May 24 '21 at 18:11
  • Thanks, can you also look into this issue if you have sometime? https://stackoverflow.com/questions/67677415/xslt-template-to-sort-uuids-in-xml-with-cdata-elements – sandy May 24 '21 at 18:48