1

I have multiple xml files and I would like to fetch values from them, and write them in a separate csv/text/excel file row by row.

I tried the grep command below:

grep -e \<r p\>  Inputfilename | sed 's/<[^>]*>//g' | awk '{ print $2 }' | awk '{ for (i=1;i<=NF;i++ ) printf $i " " }' >> Output.txt 

But this command writes all the values in a single line. I am a newbie so I am not sure how to separate the values row wise.

Here is a sample input file:

<measType p="1">Used NonHeap Mem MB</measType>
            <measType p="2">Online CPU Usage %</measType>
            <measType p="3">Used Physical Mem %</measType>
            <measType p="4">Used Physical Mem MB</measType>
            <measType p="5">Used Heap Mem %</measType>
            <measType p="6">Used Tenured Gen MB</measType>
            <measType p="7">Used Survivor Space MB</measType>
            <measType p="8">Used NonHeap Mem %</measType>
            <measType p="9">Total CPU Usage %</measType>
            <measType p="10">Used Eden Space MB</measType>
            <measType p="11">Used Heap Mem MB</measType>
            <measValue measObjLdn="">
                <r p="1">48.361183166503906</r>
                <r p="2">0.008397036232054234</r>
                <r p="3">4.5677</r>
                <r p="4">34425.0</r>
                <r p="5">68.05066879841843</r>
                <r p="6">410.58392333984375</r>
                <r p="7">22.375</r>
                <r p="8">93.67783664213832</r>
                <r p="9">0.028054807427357</r>
                <r p="10">169.9580841064453</r>
                <r p="11">602.8837356567383</r>
            </measValue>

The output I got from the above command, for this input, is:

48.361183166503906 0.008397036232054234 4.5677 34425.0 68.05066879841843 410.58392333984375 22.375 93.67783664213832 0.028054807427357 169.9580841064453 602.883735656738

When I run this command for multiple files it yields something like this:

48.361183166503906 0.008397036232054234 4.5677 34425.0 68.05066879841843 410.58392333984375 22.375 93.67783664213832 0.028054807427357 169.9580841064453 602.8837356567383  48.377540588378906 0.008116667158901691 5.73992 33834.0 10.798112742450364 42.10478973388672 22.375 93.70952172083081 0.021666161122907 31.18431854248047 95.66410827636719  58.068382263183594 3.406280755996704 6.46515 34405.0 56.60833858273274 903.4959945678711 16.5166015625 94.90236120642875 7.068469741716277 39.66230773925781 959.4206771850586

But I want the command result to be:

48.361183166503906 0.008397036232054234 4.5677 34425.0 68.05066879841843 410.58392333984375 22.375 93.67783664213832 0.028054807427357 169.9580841064453 602.8837356567383  
48.377540588378906 0.008116667158901691 5.73992 33834.0 10.798112742450364 42.10478973388672 22.375 93.70952172083081 0.021666161122907 31.18431854248047 95.66410827636719  
58.068382263183594 3.406280755996704 6.46515 34405.0 56.60833858273274 903.4959945678711 16.5166015625 94.90236120642875 7.068469741716277 39.66230773925781 959.4206771850586  

Please assist me. Thanks in advance!

Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
  • Well written question all in all (edited it a little). God job! – Rann Lifshitz Apr 16 '18 at 11:00
  • 1
    As you mention xml files, it is important to realize that tools like `sed`, `grep`, `awk` and alike are not the right tool for the job. They are essentially sledge-hammers and for xml you need a more delicate tool. I strongly advise you to have a look at `xmlstarlet` and `xpath`. This is what you need. Imagine that a `` is on a new line (still valid xml) or that `` is before ``. To this end you want to use an xml-toolbox. – kvantour Apr 16 '18 at 15:35
  • Do not use [regex to parse x|html](https://stackoverflow.com/a/1732454/1422451). – Parfait Apr 16 '18 at 18:17
  • Did either of the answers solve your problem, or are you still having issues? – Daniel Haley Apr 18 '18 at 20:29

2 Answers2

1

One option is to use xmlstarlet's tr command with an XSLT stylesheet.

Example...

XSLT 1.0 (example.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*">
    <xsl:for-each select=".//r">
      <xsl:sort select="@p" data-type="number"/>
      <xsl:if test="position() > 1">
        <xsl:text> </xsl:text>
      </xsl:if>
      <xsl:value-of select="normalize-space()"/>
    </xsl:for-each>
    <xsl:text>&#xA;</xsl:text>
  </xsl:template>

</xsl:stylesheet>

xmlstarlet command line

xml tr example.xsl *.xml

output (using two input files; the one you supplied and a copy with "b" added to each r value)

48.361183166503906 0.008397036232054234 4.5677 34425.0 68.05066879841843 410.58392333984375 22.375 93.67783664213832 0.028054807427357 169.9580841064453 602.8837356567383
48.361183166503906b 0.008397036232054234b 4.5677b 34425.0b 68.05066879841843b 410.58392333984375b 22.375b 93.67783664213832b 0.028054807427357b 169.9580841064453b 602.8837356567383b

You could also get something very similar (currently I'm getting an extra newline in the beginning of the output) with xmlstarlet's sel command:

xml sel -T -t -n -m "//r" -s A:N:T "@p" -v "normalize-space()" -o " " *.xml
Daniel Haley
  • 51,389
  • 6
  • 69
  • 95
0

Its perhaps writing everything in a single line because of printf in your awk command. printf by default does not add a line feed. Try using print or add "\n" explicitly.

Alternatively, if your measValue tab is always going to contain 11 nodes, consider using :

$ grep -e \<r p\>  Inputfilename | sed 's/<[^>]*>//g' | awk '{print $2}' | paste - - - - - - - - - - -
Gautam
  • 1,862
  • 9
  • 16