-2

I have a xml file as follows

<Module dataPath="/abc/def/xyz" handler="DataRegistry" id="id1" path="test.so"/>
<Module id="id2" path="/my/file/path">
  <Config>
    <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/xyz" />
    <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/xyz" id="V2"/>
  </Config>
</Module>

I just want to extract value of dataPath from every moduleid.

I was using, the command like

`grep 'id2' file | grep -ioPm1 "(?<=DataPath=)[^ ]+"`

which is giving me from the first module id, not for second module id. because second module is in multiple lines.

How can i do this using shell script?

Desired output would be– if i want to get the datapath of id1 module, then is should get

/my/file/path

Of for second module id, say id2, i should get datapath separated by comma

/my/file/path, /my/file/path

Or my second approach to grep the datapath is to replace the newline character between <Module and </Module> only, then i can use grep.

ggupta
  • 675
  • 1
  • 10
  • 27
  • 1
    [You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/3776858) – Cyrus May 30 '19 at 07:37
  • 2
    Possible duplicate of [How to parse XML in Bash?](https://stackoverflow.com/q/893585/608639), [How to parse XML using shellscript?](https://stackoverflow.com/q/4680143/608639), etc – jww May 30 '19 at 07:49
  • @jww, no it's not. i have edited the questions to get better understanding – ggupta May 30 '19 at 08:02
  • Your desired output shows two lines. The 1st line has the value of 1 `dataPath` attribute, and the 2nd line has the value of 2 `dataPath` attributes - each separated by a comma. Presumably that structure indicates a pattern relating to your source xml? Does each newline equate to a `` element node? i.e the 1st line indicates there is a `` with 1 associated `dataPath` attribute, and the 2nd line indicates there is 2 `dataPath` attributes associated with another ``. Presumably a list of all matching values, one value per line, is not what you want. Please clarify in OP. – RobC May 30 '19 at 09:41
  • 1
    @RobC, i have a configuration file, which is having configuration of each `module` (Known by id). Some of the `modules` have their complete information in a line, and some `module` have their information in multiple lines. I just need to extract one tag `datapath` for every `module`. This may possible that one moduleid may have multiple `datapath` – ggupta May 30 '19 at 09:51

2 Answers2

2

-m1 tells grep to exit after first matching line, that's why it prints only one line of output.
I wouldn't use a line oriented tool for this though. There are more convenient tools out there for parsing XML, such as :

xml sel -t -m '//@dataPath' -v . -n file.xml
oguz ismail
  • 1
  • 16
  • 47
  • 69
  • 1
    or `xml sel -t -v '//@dataPath' -n file.xml` – Cyrus May 30 '19 at 07:35
  • @Cyrus that'd print a blank line if there were no elements with `dataPath` attribute. – oguz ismail May 30 '19 at 07:38
  • 1
    In this case, yes. You can also omit `-n` if you don't need the final line break. – Cyrus May 30 '19 at 07:49
  • This prints each matching attribute value on a separate line. However,, I'm pretty sure the OP wants to group values per line, i.e. print all `dataPath` attribute values associated with a `Module` element node, (including those associated with descendant nodes) and delimit them with a comma. A newline (to my understanding) should act as a delimiter for each `Module` element node. – RobC May 30 '19 at 14:13
1

Firstly my answer assumes that you have actual well formed source XML. The example code you've provided doesn't have a root element - but I'll assume there is one anyway.

Bash features by themselves are not very well suited parsing XML.

This renowned Bash FAQ states the following:

Do not attempt [to extract data from an XML file] with , , , and so on (it leads to undesired results)

If you must use a shell script then utilize an XML specific command line tool, such as XMLStarlet or xsltproc. Refer to the download info here for XML Starlet if you don't have it installed already.


Solution:

  1. Given your source XML and your desired output consider utilizing the following template to achieve this.

    template.xsl

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
      <xsl:output method="text"/>
    
      <xsl:template match="node()|@*">
        <xsl:apply-templates select="node()|@*"/>
      </xsl:template>
    
      <xsl:template match="Module">
        <xsl:choose>
    
          <xsl:when test="@dataPath and not(descendant::*/@dataPath)">
            <xsl:value-of select="@dataPath"/>
            <xsl:text>&#xa;</xsl:text>
          </xsl:when>
    
          <xsl:when test="not(@dataPath) and descendant::*/@dataPath">
            <xsl:for-each select="descendant::*/@dataPath">
              <xsl:value-of select="."/>
              <xsl:if test="position()!=last()">
                <xsl:text>, </xsl:text>
              </xsl:if>
            </xsl:for-each>
            <xsl:text>&#xa;</xsl:text>
          </xsl:when>
    
          <xsl:when test="@dataPath and descendant::*/@dataPath">
            <xsl:value-of select="@dataPath"/>
            <xsl:text>, </xsl:text>
            <xsl:for-each select="descendant::*/@dataPath">
              <xsl:value-of select="."/>
              <xsl:if test="position()!=last()">
                <xsl:text>, </xsl:text>
              </xsl:if>
            </xsl:for-each>
            <xsl:text>&#xa;</xsl:text>
          </xsl:when>
    
        </xsl:choose>
      </xsl:template>
    
    </xsl:stylesheet>
    
  2. Then run either;

    • the following XML Starlet command:

      $ xml tr /path/to/template.xsl /path/to/input.xml
      
    • Or the following xsltproc command:

      $ xsltproc /path/to/template.xsl /path/to/input.xml
      

    Note: The pathnames to template.xsl and input.xml in the aforementioned command(s) should be redefined to wherever those files reside.

    Either of the commands above essentially transform your input.xml file and print the desired results.


Demo:

  1. Using the following input.xml file:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
      <Module dataPath="/abc/def/1" handler="DataRegistry" id="id1" path="test.so"/>
    
      <Module id="id2" path="/my/file/path">
        <Config>
          <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/2" />
          <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/3" id="V2"/>
        </Config>
      </Module>
    
      <Module id="id3" path="/my/file/path" dataPath="/abc/def/4">
        <Config>
          <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/5" />
          <Source cutoffpackage="1" dailyStart="20060819" dataPath="/abc/def/6" id="V2"/>
        </Config>
      </Module>
    
      <Module id="id4" path="/my/file/path" dataPath="/abc/def/7"/>
      <Module id="id5" path="/my/file/path" dataPath="/abc/def/8"/>
    
    
      <!-- The following <Module>'s have no associated `dataPath` attribute -->
      <Module id="id6">
        <Config>
          <Source cutoffpackage="1" dailyStart="20060819" id="V2"/>
        </Config>
      </Module>
    
      <Module id="id7"/>
    </root>
    
  2. Then running either of the aforementioned commands prints the following result:

    /abc/def/1
    /abc/def/2, /abc/def/3
    /abc/def/4, /abc/def/5, /abc/def/6
    /abc/def/7
    /abc/def/8
    

Additional Note:

If you wanted to avoid the use of a separate .xsl file you could inline the aforementioned XSLT template in your shell script as follows:

script.sh

#!/usr/bin/env bash

xslt() {
cat <<EOX
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>

  <xsl:template match="node()|@*">
    <xsl:apply-templates select="node()|@*"/>
  </xsl:template>

  <xsl:template match="Module">
    <xsl:choose>

      <xsl:when test="@dataPath and not(descendant::*/@dataPath)">
        <xsl:value-of select="@dataPath"/>
        <xsl:text>&#xa;</xsl:text>
      </xsl:when>

      <xsl:when test="not(@dataPath) and descendant::*/@dataPath">
        <xsl:for-each select="descendant::*/@dataPath">
          <xsl:value-of select="."/>
          <xsl:if test="position()!=last()">
            <xsl:text>, </xsl:text>
          </xsl:if>
        </xsl:for-each>
        <xsl:text>&#xa;</xsl:text>
      </xsl:when>

      <xsl:when test="@dataPath and descendant::*/@dataPath">
        <xsl:value-of select="@dataPath"/>
        <xsl:text>, </xsl:text>
        <xsl:for-each select="descendant::*/@dataPath">
          <xsl:value-of select="."/>
          <xsl:if test="position()!=last()">
            <xsl:text>, </xsl:text>
          </xsl:if>
        </xsl:for-each>
        <xsl:text>&#xa;</xsl:text>
      </xsl:when>

    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>
EOX
}

# 1. Using XML Startlet
xml tr <(xslt) /path/to/input.xml

# 2. Or using xsltproc
xsltproc <(xslt) - </path/to/input.xml

Note: The pathname to your input.xml, (i.e. the /path/to/input.xml part in script.sh above), should again be redefined to wherever that file resides.

RobC
  • 22,977
  • 20
  • 73
  • 80