New to Stackoverflow and asking a question about XML to CSV. I'm a data manager with a background in SPSS, so the XML isn't always my strongsuit. I'm trying to convert a dataset exported from a hierarchical database and stored in XML into a CSV format for a number of reasons. The original DB's structure isn't great, which is causing problems with my XSLT.
Here is the XML I have to work with. It's a 700mb file:
<ABC_Data> <UID>1</UID> <DocumentNumber>000000001</DocumentNumber> <Surname>Smith</Surname> <GivenName>John</GivenName> <BirthDateList> <BirthDate>19/06/19888</BirthDate> </BirthDateList> <StationNumberList> <StationNumber>2009981</StationNumber> </StationNumberList> <Reference> <ReferenceEn>RG 150, Volume 01 - 1</ReferenceEn> <ReferenceFr>RG 150, Volume 01 - 1</ReferenceFr> </Reference> <DigitizeList> <Image>http://data.foo.bar.com/733a.gif</Image> <Image>http://data2.for.bar.com/733b.gif</Image> </DigitizeList> <UID>2</UID> <DocumentNumber>000000002</DocumentNumber> <Surname>Kootz</Surname> <GivenName>Ernst</GivenName> <BirthDateList> <BirthDate>24/12/1984</BirthDate> </BirthDateList> <StationNumberList> <StationNumber>2000023</StationNumber> </StationNumberList> <Reference> <ReferenceEn>RG 150, Volume 01 - 1</ReferenceEn> <ReferenceFr>RG 150, Volume 01 - 1</ReferenceFr> </Reference> <DigitizeList> <Image>http://data.foo.bar.com/744a.gif</Image> <Image>http://data2.for.bar.com/755b.gif</Image> </DigitizeList> </ABC_Data>
Here is the basic XSLT I'm working from (from this thread) to convert this into a CSV format. What's happening is that the records are not nested properly, so I can't get an output that differentiates one record to another in the file. Also, the multiple
<Image>
fields are being collected together on the output without intervening separators, i.e., they're turning into 1 field into 2 or 3 or 4 distinct fields, as the number of<Image>
s might be in the text [Edit: Now solved].
Here is the XSLT:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="iso-8859-1"/>
<xsl:strip-space elements="*" />
<xsl:template match="/*/child::*">
<xsl:for-each select="child::*">
<xsl:if test="position() != last()"><xsl:value-of select="normalize-space(.)"/>;</xsl:if>
<xsl:if test="position() = last()"><xsl:value-of select="normalize-space(.)"/>;</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
- Here is the output model I'd like to get to. It answers the need for differentiated records as well as distinguishing the multiple "Image" fields with similar names:
1;0000000001;Smith;John;19/06/19888;2009981;RG 150, Volume 01 - 1;RG 150, Volume 01 - 1;>http://data.foo.bar.com/733a.gif;http://data2.for.bar.com/733b.gif
2;0000000002;Koontz;Ernst;24/12/1984;2000023;RG 150, Volume 01 - 1;RG 150, Volume 01 - 1;http://data.foo.bar.com/744a.gif;http://data2.for.bar.com/755b.gif
Can anyone suggest a way forward? I'd like to clean this up so that
- all content in separate Image fields are given a semicolon between them on the output. [Edit: Solved, thank you hivemind!]
- I can differentiate between Record 1 and Record 2, Record 3, etc.
My XSLT knowledge is nearly 10 years old, so I could use the support of the community for help on this.
Thanks.