0

New to Stackoverflow and asking a question about XML to CSV. I'm a data manager with a background in SPSS, so the XML isn't always my strongsuit. I'm trying to convert a dataset exported from a hierarchical database and stored in XML into a CSV format for a number of reasons. The original DB's structure isn't great, which is causing problems with my XSLT.

  1. Here is the XML I have to work with. It's a 700mb file:

      <ABC_Data>
        <UID>1</UID>
        <DocumentNumber>000000001</DocumentNumber>
        <Surname>Smith</Surname>
        <GivenName>John</GivenName>
        <BirthDateList>
            <BirthDate>19/06/19888</BirthDate>
        </BirthDateList>
        <StationNumberList>
            <StationNumber>2009981</StationNumber>
        </StationNumberList>
        <Reference>
            <ReferenceEn>RG 150, Volume 01 - 1</ReferenceEn>
            <ReferenceFr>RG 150, Volume 01 - 1</ReferenceFr>
        </Reference>
        <DigitizeList>
            <Image>http://data.foo.bar.com/733a.gif</Image>
            <Image>http://data2.for.bar.com/733b.gif</Image>
        </DigitizeList>
        <UID>2</UID>
        <DocumentNumber>000000002</DocumentNumber>
        <Surname>Kootz</Surname>
        <GivenName>Ernst</GivenName>
        <BirthDateList>
            <BirthDate>24/12/1984</BirthDate>
        </BirthDateList>
        <StationNumberList>
            <StationNumber>2000023</StationNumber>
        </StationNumberList>
        <Reference>
            <ReferenceEn>RG 150, Volume 01 - 1</ReferenceEn>
            <ReferenceFr>RG 150, Volume 01 - 1</ReferenceFr>
        </Reference>
        <DigitizeList>
            <Image>http://data.foo.bar.com/744a.gif</Image>
            <Image>http://data2.for.bar.com/755b.gif</Image>
    
        </DigitizeList>
        </ABC_Data>
    
  2. Here is the basic XSLT I'm working from (from this thread) to convert this into a CSV format. What's happening is that the records are not nested properly, so I can't get an output that differentiates one record to another in the file. Also, the multiple <Image> fields are being collected together on the output without intervening separators, i.e., they're turning into 1 field into 2 or 3 or 4 distinct fields, as the number of <Image>s might be in the text [Edit: Now solved].

Here is the XSLT:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" encoding="iso-8859-1"/>

    <xsl:strip-space elements="*" />

    <xsl:template match="/*/child::*">
    <xsl:for-each select="child::*">
    <xsl:if test="position() != last()"><xsl:value-of select="normalize-space(.)"/>;</xsl:if>
    <xsl:if test="position() = last()"><xsl:value-of select="normalize-space(.)"/>;</xsl:if>
    </xsl:for-each>
    </xsl:template>

    </xsl:stylesheet>
  1. Here is the output model I'd like to get to. It answers the need for differentiated records as well as distinguishing the multiple "Image" fields with similar names:

1;0000000001;Smith;John;19/06/19888;2009981;RG 150, Volume 01 - 1;RG 150, Volume 01 - 1;>http://data.foo.bar.com/733a.gif;http://data2.for.bar.com/733b.gif
2;0000000002;Koontz;Ernst;24/12/1984;2000023;RG 150, Volume 01 - 1;RG 150, Volume 01 - 1;http://data.foo.bar.com/744a.gif;http://data2.for.bar.com/755b.gif

Can anyone suggest a way forward? I'd like to clean this up so that

  • all content in separate Image fields are given a semicolon between them on the output. [Edit: Solved, thank you hivemind!]
  • I can differentiate between Record 1 and Record 2, Record 3, etc.

My XSLT knowledge is nearly 10 years old, so I could use the support of the community for help on this.

Thanks.

Community
  • 1
  • 1

2 Answers2

1

try this

<xsl:template match="/">
    <xsl:for-each select="descendant::*[not(child::*)]">
        <xsl:value-of select="normalize-space(.)"/><xsl:text>;</xsl:text>
    </xsl:for-each>
</xsl:template>
Rupesh_Kr
  • 3,395
  • 2
  • 17
  • 32
0

AFAICT, the following stylesheet will produce a result that is almost identical to the expected output:

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:strip-space elements="*" />

<xsl:key name="cells" match="ABC_Data/*[not(self::UID)]" use="generate-id(preceding-sibling::UID[1])" />

<xsl:template match="/ABC_Data">
    <xsl:for-each select="UID">
        <xsl:apply-templates select=". | key('cells', generate-id())"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

<xsl:template match="*[not(*)]">
    <xsl:value-of select="." />
    <xsl:text>;</xsl:text>
</xsl:template>

</xsl:stylesheet>

The only difference is that each row retains a trailing ; character. This is because we don't know which element is the last cell in its row - and whether it is one that contains multiple children or not.

If you do know this, you can add a template matching it by name. Otherwise you would have to place each row into a variable first, then output the variable without its last character:

<xsl:template match="/ABC_Data">
    <xsl:for-each select="UID">
        <xsl:variable name="row">
            <xsl:apply-templates select=". | key('cells', generate-id())"/>
        </xsl:variable>
        <xsl:value-of select="substring($row, 1, string-length($row) - 1)" />
        <xsl:text>&#10;</xsl:text>
    </xsl:for-each>
</xsl:template>

As an aside, I have doubts about the usefulness of this result. As a recipient of a CSV file, I would expect each column to have data from the same domain (in fact, I would expected each column to have a label). At least in theory, your input could contain records with varying number of BirthDates, StationNumbers, References, etc, - resulting in rows containing a varying number of cells in misaligned columns.

michael.hor257k
  • 113,275
  • 6
  • 33
  • 51
  • Thank you on all counts! I've tested it with short excerpts from the original dataset and and the output is what I need. Re your concerns: we're pretty well certain that the only variance in the fields will be the number of potential "Image" elements/cells, which is at the end of the row, so things should work out there. It should give us the necessary structure. Now I just need to find a tool to transform the entire 700mb file and I hope to be off to the races... – mitchellirons Feb 24 '17 at 18:18
  • It's hopefully a one-off experience.. Relying on the kindness of strangers until our coder gets back from vacation. – mitchellirons Feb 24 '17 at 18:30