0

Here is sample file and we need to convert values into delimiter formatted file :-

test.xml

<?xml version="1.0" encoding="UTF-8" ?>
 <testjar>

 <testable>
 <trigger>Trigger1</trigger>
 <message>2012-06-14T00:03.54</message>
 <sales-info>
 <san-a>no</san-a>
 <san-b>no</san-b>
 <san-c>no</san-c>
 </sales-info>
 </testable>


  <testable>
  <trigger>Trigger2</trigger>
  <message>2012-06-15T00:03.54</message>
  <sales-info>
  <san-a>yes</san-a>
  <san-b>yes</san-b>
  <san-c>no</san-c>
  </sales-info>
 </testable>

 </testjar>

Each record should start on new line. Sample result set should be something like this sample.txt

Trigger1|2012-06-14T00:03.54|no|no|no  
Trigger2|2012-06-15T00:03.54|yes|yes|no

Note :- xmlstarlet is not installed on my server, is it possible to perform this without xmlstarlet?

Pravin Satav
  • 702
  • 5
  • 17
  • 36

3 Answers3

1

If you have xmlstarlet installed, you can try:

me@home$ xmlstarlet sel -t -m "//testable" -v trigger -o "|" -v message -o "|" -m sales-info -v san-a -o "|" -v san-b -o "|" -v san-c -n test.xml
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

Breakdown of the command:

xmlstarlet sel -t 
    -m "//testable"       # match <testable>
      -v trigger -o "|"     # print out value of <trigger> followed by |
      -v message -o "|"     # print out value of <message> followed by | 
      -m sales-info         # match <sales-info>
        -v san-a -o "|"       # print out value of <san-a> followed by |
        -v san-b -o "|"       # print out value of <san-b> followed by | 
        -v san-c              # print out value of <san-c>
    -n                   # print new line
    test.xml             # INPUT XML FILE

To target tags that varies within <testable>, you can try the following which returns the text of all leaf nodes:

ma@home$ xmlstarlet sel -t -m "//testable" -m "descendant::*[not(*)]" -v 'text()' -i 'not(position()=last())' -o '|' -b -b -n test.xml 
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

Beakdown of the command:

xmlstarlet sel -t 
    -m "//testable"                         # match <testable>
      -m "descendant::*[not(*)]"              # match all leaf nodes
        -v 'text()'                             # print text
        -i 'not(position()=last())' -o '|'      # print | if not last item
        -b -b                                   # break out of nested matches
    -n                                      # print new line
    test.xml                                # INPUT XML FILE

If you do not have access to xmlstarlet, then do look up what other tools you have at your disposal. Other options would include xsltproc (see mzjn's answer) and xpath.

If those tools are not available, I would suggest using a higher level language (Python, Perl) which gives you access to a proper XML library.

While it is possible to parse it manually using regex, such a solution would not be ideal especially with inconsistent inputs. For example, the following (assuming you have gawk and sed) takes your input and should spits out the expected output:

me@home$ gawk 'match($0, />(.*)</, a){printf("%s|",a[1])} /<\/testable>/{print ""}' test.xml | sed 's/.$//'
Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no

However, this would fail miserably if the input format changes and is therefore not a solution I would generally recommend.

Community
  • 1
  • 1
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
  • catch here is my file(xml tages will increase or decrease)will keep changing..is there a command which can take care of this? – Pravin Satav Jul 26 '12 at 09:04
  • Do you mean the tags within `` is always different? – Shawn Chin Jul 26 '12 at 09:15
  • Yeah.. but we can store tages in another file and fetch that info here...we can manage that.. Big issue is unfortunately I dont have xmlstarlet at my server :-( Can this be possible without xmlstarlet? – Pravin Satav Jul 26 '12 at 09:18
  • What OS are you using? And are you allowed to install additional tools from the default package manager? (I'm trying to avoid a regex/text-parsing approach here since it can be unreliable especially if the input format is not consistent) – Shawn Chin Jul 26 '12 at 09:22
  • Linux version 2.6.9-100.ELsmp. I m not allowed to install additional tools. Thanks – Pravin Satav Jul 26 '12 at 09:30
  • @PravinSatav I'm assuming That's a Red Hat? I'm afraid it's difficult to provide an ideal solution without knowing what tools you have and don't have. – Shawn Chin Jul 26 '12 at 09:56
  • @PravinSatav See updated question for an example that uses gawk and sed (note the disclaimer in the text). – Shawn Chin Jul 26 '12 at 14:55
1

Here is an XSLT stylesheet that does what you want (saved in test.xsl):

<?xml version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0">

<xsl:output method="text"/>
<xsl:strip-space elements="*"/>

 <xsl:template match="testable">
   <xsl:value-of select='trigger'/><xsl:text>|</xsl:text>
   <xsl:value-of select='message'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-a'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-b'/><xsl:text>|</xsl:text>
   <xsl:value-of select='sales-info/san-c'/><xsl:text>&#xA;</xsl:text>
 </xsl:template>

</xsl:stylesheet>

Command (here I am assuming that you have libxml2 and libxslt installed; xsltproc is a command line tool that uses these libraries):

xsltproc -o sample.txt test.xsl test.xml

Contents of sample.txt:

Trigger1|2012-06-14T00:03.54|no|no|no
Trigger2|2012-06-15T00:03.54|yes|yes|no
mzjn
  • 48,958
  • 13
  • 128
  • 248
1

Here's a pure bash solution:

egrep '<trigger>|<message>|<san-.>' test.xml | sed -e 's/<[^>]*>//g' | while read line; do [ $((++i % 5)) -ne 0 ] && echo -n "$line|" || echo $line ; done

However, it only works on a file formatted as in your sample (each element in a separate row), it's not even closely as flexible / reliable as the other answers involving proper XML parsing / transforming.

It can be enhanced to some extent though...

Costi Ciudatu
  • 37,042
  • 7
  • 56
  • 92