0

I have the following xml to parse and extract the value of tag based on the value of tag. Extract only if type == 'hosted'. I would like to extract using the bash tools like grep, sed and awk. Extracting single tag value with no condition is something I have done it before, not with conditionals. I can easily get it done using python or any other programming language i know. But this is would be ideal if done in the shell script.

...
    <repositories-item>
      <name>hosted-npm</name>
      <type>hosted</type>
    </repositories-item>
    <repositories-item>
      <name>proxied-npm</name>
      <type>proxied</type>
    </repositories-item>
...
Krish
  • 467
  • 1
  • 6
  • 16
  • 1
    I urge you to use XML tools to process XML (in Python, etc., too). In particular, you might consider writing an appropriate XSLT stylesheet, and using a command such as `xsltproc` from your script to apply the stylesheet to your XML input. – John Bollinger Apr 12 '17 at 16:05
  • Check out [xmlstarlet](http://xmlstar.sourceforge.net/) – Dima Chubarov Apr 12 '17 at 16:05

2 Answers2

3

xmlstarlet is a command line XML Toolkit that can express complex XSLT templates as a short sequence of command line switches.

Suppose we are provided with a well-formed XML document repos.xml

<repositories>
  <repositories-item>
      <name>hosted-npm</name>
      <type>hosted</type>
    </repositories-item>
    <repositories-item>
      <name>proxied-npm</name>
      <type>proxied</type>
    </repositories-item>
</repositories>

If you run it through an XMLStarlet filter with the following switches

$ cat repos.xml | xmlstarlet sel -t -m '//repositories-item' \
                 -i 'type="hosted"' -v 'name' -n 

You will get one line of output

hosted-npm

Let's look at the XMLStarlet command line.

  1. We run the command in the Select mode specified with the sel switch
  2. We specify the selection template with the -t switch
  3. We restrict parser to <repositories-item> elements with the //repositories-item template specified with the -m swicth
  4. We choose only these elements that have "hosted" as the value of type element specified with the -i switch
  5. We print out the value of the name element, specified with the -v switch.
  6. After each line of output we print a newline specified with the -n switch.

Here is the equivalent XSLT generated by XMLStarlet

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:exslt="http://exslt.org/common" version="1.0" extension-element-prefixes="exslt">
  <xsl:output omit-xml-declaration="yes" indent="no"/>
  <xsl:template match="/">
    <xsl:for-each select="//repositories-item">
      <xsl:choose>
        <xsl:when test="type=&quot;hosted&quot;">
          <xsl:call-template name="value-of-template">
            <xsl:with-param name="select" select="name"/>
          </xsl:call-template>
          <xsl:value-of select="'&#10;'"/>
        </xsl:when>
      </xsl:choose>
    </xsl:for-each>
  </xsl:template>
  <xsl:template name="value-of-template">
    <xsl:param name="select"/>
    <xsl:value-of select="$select"/>
    <xsl:for-each select="exslt:node-set($select)[position()&gt;1]">
      <xsl:value-of select="'&#10;'"/>
      <xsl:value-of select="."/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

Per Charles Duffy suggestion it is worth noting that this XSLT specification can be generated with XMLStarlet using the -C option:

xmlstarlet sel -C -t -m '//repositories-item' \
       -i 'type="hosted"' -v 'name' -n > hosted-repos.xslt

This generated XSLT specification can be directly used with xsltproc as

cat repos.xml | xsltproc hosted-repos.xslt - 
Dima Chubarov
  • 16,199
  • 6
  • 40
  • 76
  • 2
    It might be worth showing the OP how to use `xsltproc` to apply that XSLT, just to be sure they understand they can use this solution on servers without XMLStarlet installed. – Charles Duffy Apr 12 '17 at 16:58
1

lacking xml specific tools

awk to the rescue using the enclosing tags to define the record delimiters

$ awk -v RS='</?repositories-item>' '/<type>hosted<\/type>/' file

  <name>hosted-npm</name>
  <type>hosted</type>

note that this requires multi-char RS which GNU awk supports.

you can have more control on the match and output

$ awk -v RS='</?repositories-item>' -F'[<>]' '
    {delete a; 
     for(i=2;i<=NF;i+=4) a[$i]=$(i+1); 
     if(a["type"]=="hosted") print a["name"] }' file


hosted-npm
karakfa
  • 66,216
  • 7
  • 41
  • 56