165

Ideally, what I would like to be able to do is:

cat xhtmlfile.xhtml |
getElementViaXPath --path='/html/head/title' |
sed -e 's%(^<title>|</title>$)%%g' > titleOfXHTMLPage.txt
Zombo
  • 1
  • 62
  • 391
  • 407
  • 2
    http://unix.stackexchange.com/questions/83385/parse-xml-to-get-node-value-in-bash-script || http://superuser.com/questions/369996/scripting-what-is-the-easiest-to-extact-a-value-in-a-tag-of-a-xml-file – Ciro Santilli OurBigBook.com Oct 07 '15 at 10:57
  • `echo 'Example' | yq -p xml '.html.head.title'` outputs `Example`. See: [yq](https://mikefarah.gitbook.io/yq/), [some examples](https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash/74391877#74391877) – jpseng Dec 06 '22 at 20:41

17 Answers17

182

This is really just an explaination of Yuzem's answer, but I didn't feel like this much editing should be done to someone else, and comments don't allow formatting, so...

rdom () { local IFS=\> ; read -d \< E C ;}

Let's call that "read_dom" instead of "rdom", space it out a bit and use longer variables:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
}

Okay so it defines a function called read_dom. The first line makes IFS (the input field separator) local to this function and changes it to >. That means that when you read data instead of automatically being split on space, tab or newlines it gets split on '>'. The next line says to read input from stdin, and instead of stopping at a newline, stop when you see a '<' character (the -d for deliminator flag). What is read is then split using the IFS and assigned to the variable ENTITY and CONTENT. So take the following:

<tag>value</tag>

The first call to read_dom get an empty string (since the '<' is the first character). That gets split by IFS into just '', since there isn't a '>' character. Read then assigns an empty string to both variables. The second call gets the string 'tag>value'. That gets split then by the IFS into the two fields 'tag' and 'value'. Read then assigns the variables like: ENTITY=tag and CONTENT=value. The third call gets the string '/tag>'. That gets split by the IFS into the two fields '/tag' and ''. Read then assigns the variables like: ENTITY=/tag and CONTENT=. The fourth call will return a non-zero status because we've reached the end of file.

Now his while loop cleaned up a bit to match the above:

while read_dom; do
    if [[ $ENTITY = "title" ]]; then
        echo $CONTENT
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt

The first line just says, "while the read_dom functionreturns a zero status, do the following." The second line checks if the entity we've just seen is "title". The next line echos the content of the tag. The four line exits. If it wasn't the title entity then the loop repeats on the sixth line. We redirect "xhtmlfile.xhtml" into standard input (for the read_dom function) and redirect standard output to "titleOfXHTMLPage.txt" (the echo from earlier in the loop).

Now given the following (similar to what you get from listing a bucket on S3) for input.xml:

<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Name>sth-items</Name>
  <IsTruncated>false</IsTruncated>
  <Contents>
    <Key>item-apple-iso@2x.png</Key>
    <LastModified>2011-07-25T22:23:04.000Z</LastModified>
    <ETag>&quot;0032a28286680abee71aed5d059c6a09&quot;</ETag>
    <Size>1785</Size>
    <StorageClass>STANDARD</StorageClass>
  </Contents>
</ListBucketResult>

and the following loop:

while read_dom; do
    echo "$ENTITY => $CONTENT"
done < input.xml

You should get:

 => 
ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/" => 
Name => sth-items
/Name => 
IsTruncated => false
/IsTruncated => 
Contents => 
Key => item-apple-iso@2x.png
/Key => 
LastModified => 2011-07-25T22:23:04.000Z
/LastModified => 
ETag => &quot;0032a28286680abee71aed5d059c6a09&quot;
/ETag => 
Size => 1785
/Size => 
StorageClass => STANDARD
/StorageClass => 
/Contents => 

So if we wrote a while loop like Yuzem's:

while read_dom; do
    if [[ $ENTITY = "Key" ]] ; then
        echo $CONTENT
    fi
done < input.xml

We'd get a listing of all the files in the S3 bucket.

EDIT If for some reason local IFS=\> doesn't work for you and you set it globally, you should reset it at the end of the function like:

read_dom () {
    ORIGINAL_IFS=$IFS
    IFS=\>
    read -d \< ENTITY CONTENT
    IFS=$ORIGINAL_IFS
}

Otherwise, any line splitting you do later in the script will be messed up.

EDIT 2 To split out attribute name/value pairs you can augment the read_dom() like so:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local ret=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $ret
}

Then write your function to parse and get the data you want like this:

parse_dom () {
    if [[ $TAG_NAME = "foo" ]] ; then
        eval local $ATTRIBUTES
        echo "foo size is: $size"
    elif [[ $TAG_NAME = "bar" ]] ; then
        eval local $ATTRIBUTES
        echo "bar type is: $type"
    fi
}

Then while you read_dom call parse_dom:

while read_dom; do
    parse_dom
done

Then given the following example markup:

<example>
  <bar size="bar_size" type="metal">bars content</bar>
  <foo size="1789" type="unknown">foos content</foo>
</example>

You should get this output:

$ cat example.xml | ./bash_xml.sh 
bar type is: metal
foo size is: 1789

EDIT 3 another user said they were having problems with it in FreeBSD and suggested saving the exit status from read and returning it at the end of read_dom like:

read_dom () {
    local IFS=\>
    read -d \< ENTITY CONTENT
    local RET=$?
    TAG_NAME=${ENTITY%% *}
    ATTRIBUTES=${ENTITY#* }
    return $RET
}

I don't see any reason why that shouldn't work

Community
  • 1
  • 1
chad
  • 2,888
  • 1
  • 17
  • 15
  • The listing is nice but I don't really know where to go from there. So say I wanted to put "1785", the "size" in a variable. How would I do that? – obesechicken13 Jul 23 '12 at 14:01
  • @obesechicken13 Easy, let's say your variable is named `num`: look at the very last `while` loop in chad's answer. Instead of `echo $CONTENT` put `num=$CONTENT`. –  Jul 25 '12 at 06:00
  • For me, the `read_dom` function only works if I make the IFS global : `IFS='>'`. I had to remove the `local`. –  Jul 25 '12 at 06:09
  • 2
    If you make IFS (the input field separator) global you should reset it back to its original value at the end, I edited the answer to have that. Otherwise any other input splitting you do later in your script will be messed up. I suspect the reason local doesn't work for you is because either you are using bash in a compatibility mode (like your shbang is #!/bin/sh) or it's an ancient version of bash. – chad Jul 25 '12 at 15:24
  • @obesechicken13, I added an example of parsing attributes. – chad Jul 25 '12 at 16:05
  • 42
    Just because you can write your own parser, doesn't mean you should. – Stephen Niedzielski Apr 23 '13 at 21:49
  • 1
    @chad it certainly says something about AWS' workflow/implementation that I was searching for an answer to "bash xml" to also *wget* the contents of an S3 bucket! – Alastair Sep 18 '13 at 12:36
  • 1
    @Alastair I have a whole set of S3 manipulation bash scripts, I'll ask my manager if I can release them. – chad Oct 10 '13 at 15:42
  • 3
    @Alastair see https://github.com/chad3814/s3scripts for a set of bash scripts that we use to manipulate S3 objects – chad Oct 11 '13 at 14:27
  • Grokkin' contribution, there, @chad ! Checking them out now! – Alastair Oct 16 '13 at 17:36
  • 6
    Assigning IFS in a local variable is fragile and not necessary. Just do: `IFS=\< read ...`, which will only set IFS for the read call. (Note that I am in no way endorsing the practice of using `read` to parse xml, and I believe doing so is fraught with peril and ought to be avoided.) – William Pursell Nov 27 '13 at 16:47
  • 1
    Downvoted for attempting to roll your own XML parser. This is an extremely bad idea. – Tomalak Apr 30 '15 at 11:06
  • xml has a nested structure & you can have the same 'entity' names, with this way you lost this nested structure. Meaning you can't fetch the information you need. Especially when the entity names are the same, eg. ```VolvoAudio``` It's even worse when you want the list of all the 'cars'. – Melroy van den Berg Jan 04 '16 at 14:02
  • 1
    **Parsing XML with grep and awk is not okay**. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever. – peterh Feb 01 '18 at 16:33
  • Really cool ! But , what about examle like ``, this will result into `ATTRIBUTES=path=/path/to/my/dev' /`, is there an easy way to remove `/` ? – Alex Belous Jun 27 '22 at 12:02
  • This is an old question, but I'm now very interested to learn more from @StephenNiedzielski Tomalak and peterh about this. Can you elaborate, or provide links on why this is so problematic?? – DryLabRebel Jan 18 '23 at 01:12
  • 2
    I apologize that my comment was so rude and negative. My perspective is that an ad hoc XML parser implementation is unlikely to be correct (maybe this one is but there isn't a comprehensive suite of tests that prove it). Picking one of the dedicated tools mentioned in the other answers that solve XML parsing well seems much more practical to me. – Stephen Niedzielski Jan 19 '23 at 05:16
  • 1
    @dryLabRebel I (the guy that wrote this) would agree with Stephen Niedzielski. Eleven and a half years later there must certainly be better ways to get XML data into bash. As kinda alluded to in the post, this was used for raw communication with S3, and didn't require anything other than bash and curl. The XML generated by S3 being well known and formed makes this bespoke code possible. This is definitely not a generally usable XML parser. – chad Apr 06 '23 at 15:39
  • Thanks for the feedback. I was imagining that part of the reason would've been potential security risks with xml. – DryLabRebel Apr 12 '23 at 01:50
75

Command-line tools that can be called from shell scripts include:

  • 4xpath - command-line wrapper around Python's 4Suite package

  • XMLStarlet

  • xpath - command-line wrapper around Perl's XPath library

    sudo apt-get install libxml-xpath-perl
    
  • Xidel - Works with URLs as well as files. Also works with JSON

I also use xmllint and xsltproc with little XSL transform scripts to do XML processing from the command line or in shell scripts.

Kashyap
  • 15,354
  • 13
  • 64
  • 103
Nat
  • 9,820
  • 3
  • 31
  • 33
72

You can do that very easily using only bash. You only have to add this function:

rdom () { local IFS=\> ; read -d \< E C ;}

Now you can use rdom like read but for html documents. When called rdom will assign the element to variable E and the content to var C.

For example, to do what you wanted to do:

while rdom; do
    if [[ $E = title ]]; then
        echo $C
        exit
    fi
done < xhtmlfile.xhtml > titleOfXHTMLPage.txt
Yuzem
  • 769
  • 5
  • 2
  • could you elaborate on this? i'd bet that it's perfectly clear to you.. and this could be a great answer - if I could tell what you were doing there.. can you break it down a little more, possibly generating some sample output? – Alex Gray Jul 04 '11 at 02:14
  • 2
    Cred to the original - this one-liner is so freakin' elegant and amazing. – maverick Dec 05 '13 at 22:06
  • 2
    great hack, but i had to use double quotes like echo "$C" to prevent shell expansion and correct interpretation of end lines (depends on the enconding) – user311174 Jan 16 '14 at 10:32
  • 16
    **Parsing XML with grep and awk is not okay**. It may be an acceptable compromise if the XMLs are enough simple and you have not too much time, but it can't be called a good solution ever. – peterh Feb 01 '18 at 16:34
26

You can use xpath utility. It's installed with the Perl XML-XPath package.

Usage:

/usr/bin/xpath [filename] query

or XMLStarlet. To install it on opensuse use:

sudo zypper install xmlstarlet

or try cnf xml on other platforms.

mtk
  • 13,221
  • 16
  • 72
  • 112
Grisha
  • 296
  • 3
  • 6
  • 5
    Using xml starlet is definitely a better option than writing one's own serializer (as suggested in the other answers). – Bruno von Paris Feb 08 '13 at 15:26
  • 1
    On many systems, the `xpath` which comes preinstalled is unsuitable for use as a component in scripts. See e.g. http://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell for an elaboration. – tripleee Jul 27 '16 at 08:47
  • 2
    On Ubuntu/Debian `apt-get install xmlstarlet` – rubo77 Dec 24 '16 at 00:48
16

This is sufficient...

xpath xhtmlfile.xhtml '/html/head/title/text()' > titleOfXHTMLPage.txt
teknopaul
  • 6,505
  • 2
  • 30
  • 24
8

Check out XML2 from http://www.ofb.net/~egnor/xml2/ which converts XML to a line-oriented format.

simon04
  • 3,054
  • 29
  • 25
  • 1
    Very useful tool. The link is broken, (see https://web.archive.org/web/20160312110413/https://dan.egnor.name/xml2/ ) but there is a working, frozen clone on github: https://github.com/clone/xml2 – Joshua Goldberg Nov 06 '20 at 19:45
6

Another command line tool is my new Xidel. It also supports XPath 2 and XQuery, contrary to the already mentioned xpath/xmlstarlet.

The title can be read like:

xidel xhtmlfile.xhtml -e /html/head/title > titleOfXHTMLPage.txt

And it also has a cool feature to export multiple variables to bash. For example

eval $(xidel xhtmlfile.xhtml -e 'title := //title, imgcount := count(//img)' --output-format bash )

sets $title to the title and $imgcount to the number of images in the file, which should be as flexible as parsing it directly in bash.

BeniBela
  • 16,412
  • 4
  • 45
  • 52
5

starting from the chad's answer, here is the COMPLETE working solution to parse UML, with propper handling of comments, with just 2 little functions (more than 2 bu you can mix them all). I don't say chad's one didn't work at all, but it had too much issues with badly formated XML files: So you have to be a bit more tricky to handle comments and misplaced spaces/CR/TAB/etc.

The purpose of this answer is to give ready-2-use, out of the box bash functions to anyone needing parsing UML without complex tools using perl, python or anything else. As for me, I cannot install cpan, nor perl modules for the old production OS i'm working on, and python isn't available.

First, a definition of the UML words used in this post:

<!-- comment... -->
<tag attribute="value">content...</tag>

EDIT: updated functions, with handle of:

  • Websphere xml (xmi and xmlns attributes)
  • must have a compatible terminal with 256 colors
  • 24 shades of grey
  • compatibility added for IBM AIX bash 3.2.16(1)

The functions, first is the xml_read_dom which's called recursively by xml_read:

xml_read_dom() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
local ENTITY IFS=\>
if $ITSACOMMENT; then
  read -d \< COMMENTS
  COMMENTS="$(rtrim "${COMMENTS}")"
  return 0
else
  read -d \< ENTITY CONTENT
  CR=$?
  [ "x${ENTITY:0:1}x" == "x/x" ] && return 0
  TAG_NAME=${ENTITY%%[[:space:]]*}
  [ "x${TAG_NAME}x" == "x?xmlx" ] && TAG_NAME=xml
  TAG_NAME=${TAG_NAME%%:*}
  ATTRIBUTES=${ENTITY#*[[:space:]]}
  ATTRIBUTES="${ATTRIBUTES//xmi:/}"
  ATTRIBUTES="${ATTRIBUTES//xmlns:/}"
fi

# when comments sticks to !-- :
[ "x${TAG_NAME:0:3}x" == "x!--x" ] && COMMENTS="${TAG_NAME:3} ${ATTRIBUTES}" && ITSACOMMENT=true && return 0

# http://tldp.org/LDP/abs/html/string-manipulation.html
# INFO: oh wait it doesn't work on IBM AIX bash 3.2.16(1):
# [ "x${ATTRIBUTES:(-1):1}x" == "x/x" -o "x${ATTRIBUTES:(-1):1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:(-1)}"
[ "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x/x" -o "x${ATTRIBUTES:${#ATTRIBUTES} -1:1}x" == "x?x" ] && ATTRIBUTES="${ATTRIBUTES:0:${#ATTRIBUTES} -1}"
return $CR
}

and the second one :

xml_read() {
# https://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash
ITSACOMMENT=false
local MULTIPLE_ATTR LIGHT FORCE_PRINT XAPPLY XCOMMAND XATTRIBUTE GETCONTENT fileXml tag attributes attribute tag2print TAGPRINTED attribute2print XAPPLIED_COLOR PROSTPROCESS USAGE
local TMP LOG LOGG
LIGHT=false
FORCE_PRINT=false
XAPPLY=false
MULTIPLE_ATTR=false
XAPPLIED_COLOR=g
TAGPRINTED=false
GETCONTENT=false
PROSTPROCESS=cat
Debug=${Debug:-false}
TMP=/tmp/xml_read.$RANDOM
USAGE="${C}${FUNCNAME}${c} [-cdlp] [-x command <-a attribute>] <file.xml> [tag | \"any\"] [attributes .. | \"content\"]
${nn[2]}  -c = NOCOLOR${END}
${nn[2]}  -d = Debug${END}
${nn[2]}  -l = LIGHT (no \"attribute=\" printed)${END}
${nn[2]}  -p = FORCE PRINT (when no attributes given)${END}
${nn[2]}  -x = apply a command on an attribute and print the result instead of the former value, in green color${END}
${nn[1]}  (no attribute given will load their values into your shell; use '-p' to print them as well)${END}"

! (($#)) && echo2 "$USAGE" && return 99
(( $# < 2 )) && ERROR nbaram 2 0 && return 99
# getopts:
while getopts :cdlpx:a: _OPT 2>/dev/null
do
{
  case ${_OPT} in
    c) PROSTPROCESS="${DECOLORIZE}" ;;
    d) local Debug=true ;;
    l) LIGHT=true; XAPPLIED_COLOR=END ;;
    p) FORCE_PRINT=true ;;
    x) XAPPLY=true; XCOMMAND="${OPTARG}" ;;
    a) XATTRIBUTE="${OPTARG}" ;;
    *) _NOARGS="${_NOARGS}${_NOARGS+, }-${OPTARG}" ;;
  esac
}
done
shift $((OPTIND - 1))
unset _OPT OPTARG OPTIND
[ "X${_NOARGS}" != "X" ] && ERROR param "${_NOARGS}" 0

fileXml=$1
tag=$2
(( $# > 2 )) && shift 2 && attributes=$*
(( $# > 1 )) && MULTIPLE_ATTR=true

[ -d "${fileXml}" -o ! -s "${fileXml}" ] && ERROR empty "${fileXml}" 0 && return 1
$XAPPLY && $MULTIPLE_ATTR && [ -z "${XATTRIBUTE}" ] && ERROR param "-x command " 0 && return 2
# nb attributes == 1 because $MULTIPLE_ATTR is false
[ "${attributes}" == "content" ] && GETCONTENT=true

while xml_read_dom; do
  # (( CR != 0 )) && break
  (( PIPESTATUS[1] != 0 )) && break

  if $ITSACOMMENT; then
    # oh wait it doesn't work on IBM AIX bash 3.2.16(1):
    # if [ "x${COMMENTS:(-2):2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:(-2)}" && ITSACOMMENT=false
    # elif [ "x${COMMENTS:(-3):3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:(-3)}" && ITSACOMMENT=false
    if [ "x${COMMENTS:${#COMMENTS} - 2:2}x" == "x--x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 2}" && ITSACOMMENT=false
    elif [ "x${COMMENTS:${#COMMENTS} - 3:3}x" == "x-->x" ]; then COMMENTS="${COMMENTS:0:${#COMMENTS} - 3}" && ITSACOMMENT=false
    fi
    $Debug && echo2 "${N}${COMMENTS}${END}"
  elif test "${TAG_NAME}"; then
    if [ "x${TAG_NAME}x" == "x${tag}x" -o "x${tag}x" == "xanyx" ]; then
      if $GETCONTENT; then
        CONTENT="$(trim "${CONTENT}")"
        test ${CONTENT} && echo "${CONTENT}"
      else
        # eval local $ATTRIBUTES => eval test "\"\$${attribute}\"" will be true for matching attributes
        eval local $ATTRIBUTES
        $Debug && (echo2 "${m}${TAG_NAME}: ${M}$ATTRIBUTES${END}"; test ${CONTENT} && echo2 "${m}CONTENT=${M}$CONTENT${END}")
        if test "${attributes}"; then
          if $MULTIPLE_ATTR; then
            # we don't print "tag: attr=x ..." for a tag passed as argument: it's usefull only for "any" tags so then we print the matching tags found
            ! $LIGHT && [ "x${tag}x" == "xanyx" ] && tag2print="${g6}${TAG_NAME}: "
            for attribute in ${attributes}; do
              ! $LIGHT && attribute2print="${g10}${attribute}${g6}=${g14}"
              if eval test "\"\$${attribute}\""; then
                test "${tag2print}" && ${print} "${tag2print}"
                TAGPRINTED=true; unset tag2print
                if [ "$XAPPLY" == "true" -a "${attribute}" == "${XATTRIBUTE}" ]; then
                  eval ${print} "%s%s\ " "\${attribute2print}" "\${${XAPPLIED_COLOR}}\"\$(\$XCOMMAND \$${attribute})\"\${END}" && eval unset ${attribute}
                else
                  eval ${print} "%s%s\ " "\${attribute2print}" "\"\$${attribute}\"" && eval unset ${attribute}
                fi
              fi
            done
            # this trick prints a CR only if attributes have been printed durint the loop:
            $TAGPRINTED && ${print} "\n" && TAGPRINTED=false
          else
            if eval test "\"\$${attributes}\""; then
              if $XAPPLY; then
                eval echo "\${g}\$(\$XCOMMAND \$${attributes})" && eval unset ${attributes}
              else
                eval echo "\$${attributes}" && eval unset ${attributes}
              fi
            fi
          fi
        else
          echo eval $ATTRIBUTES >>$TMP
        fi
      fi
    fi
  fi
  unset CR TAG_NAME ATTRIBUTES CONTENT COMMENTS
done < "${fileXml}" | ${PROSTPROCESS}
# http://mywiki.wooledge.org/BashFAQ/024
# INFO: I set variables in a "while loop" that's in a pipeline. Why do they disappear? workaround:
if [ -s "$TMP" ]; then
  $FORCE_PRINT && ! $LIGHT && cat $TMP
  # $FORCE_PRINT && $LIGHT && perl -pe 's/[[:space:]].*?=/ /g' $TMP
  $FORCE_PRINT && $LIGHT && sed -r 's/[^\"]*([\"][^\"]*[\"][,]?)[^\"]*/\1 /g' $TMP
  . $TMP
  rm -f $TMP
fi
unset ITSACOMMENT
}

and lastly, the rtrim, trim and echo2 (to stderr) functions:

rtrim() {
local var=$@
var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters
echo -n "$var"
}
trim() {
local var=$@
var="${var#"${var%%[![:space:]]*}"}"   # remove leading whitespace characters
var="${var%"${var##*[![:space:]]}"}"   # remove trailing whitespace characters
echo -n "$var"
}
echo2() { echo -e "$@" 1>&2; }

Colorization:

oh and you will need some neat colorizing dynamic variables to be defined at first, and exported, too:

set -a
TERM=xterm-256color
case ${UNAME} in
AIX|SunOS)
  M=$(${print} '\033[1;35m')
  m=$(${print} '\033[0;35m')
  END=$(${print} '\033[0m')
;;
*)
  m=$(tput setaf 5)
  M=$(tput setaf 13)
  # END=$(tput sgr0)          # issue on Linux: it can produces ^[(B instead of ^[[0m, more likely when using screenrc
  END=$(${print} '\033[0m')
;;
esac
# 24 shades of grey:
for i in $(seq 0 23); do eval g$i="$(${print} \"\\033\[38\;5\;$((232 + i))m\")" ; done
# another way of having an array of 5 shades of grey:
declare -a colorNums=(238 240 243 248 254)
for num in 0 1 2 3 4; do nn[$num]=$(${print} "\033[38;5;${colorNums[$num]}m"); NN[$num]=$(${print} "\033[48;5;${colorNums[$num]}m"); done
# piped decolorization:
DECOLORIZE='eval sed "s,${END}\[[0-9;]*[m|K],,g"'

How to load all that stuff:

Either you know how to create functions and load them via FPATH (ksh) or an emulation of FPATH (bash)

If not, just copy/paste everything on the command line.

How does it work:

xml_read [-cdlp] [-x command <-a attribute>] <file.xml> [tag | "any"] [attributes .. | "content"]
  -c = NOCOLOR
  -d = Debug
  -l = LIGHT (no \"attribute=\" printed)
  -p = FORCE PRINT (when no attributes given)
  -x = apply a command on an attribute and print the result instead of the former value, in green color
  (no attribute given will load their values into your shell as $ATTRIBUTE=value; use '-p' to print them as well)

xml_read server.xml title content     # print content between <title></title>
xml_read server.xml Connector port    # print all port values from Connector tags
xml_read server.xml any port          # print all port values from any tags

With Debug mode (-d) comments and parsed attributes are printed to stderr

peterh
  • 11,875
  • 18
  • 85
  • 108
scavenger
  • 414
  • 4
  • 10
  • I'm trying to use the above two functions which produces the following: `./read_xml.sh: line 22: (-1): substring expression < 0`? – khmarbaise Mar 05 '14 at 08:37
  • Line 22: `[ "x${ATTRIBUTES:(-1):1}x" == "x?x" ] ...` – khmarbaise Mar 05 '14 at 08:47
  • sorry khmarbaise, these are bash shell functions. If you want to adapt them as shell scripts, you certainly have to expect some minor adaptations! Also the updated functions handle your errors ;) – scavenger Apr 08 '15 at 03:36
  • Be nice if this was in a shell script for people like me who need it to run from a service. – JPM Aug 25 '23 at 19:21
4

yq can be used for XML parsing (required version for the examples below: >= 4.30.5).

It is a lightweight and portable command-line YAML processor and can also deal with XML. The syntax is similar to jq.

Input

<root>
  <myel name="Foo" />
  <myel name="Bar">
    <mysubel>stairway to heaven</mysubel>
  </myel>
</root>

Usage example 1

yq --input-format xml '.root.myel.0.+@name' $FILE

Foo

Usage example 2

yq has a nice builtin feature to make XML easily grep-able

yq --input-format xml --output-format props $FILE

root.myel.0.+@name = Foo
root.myel.1.+@name = Bar
root.myel.1.mysubel = stairway to heaven

Usage example 3

yq can also convert an XML input into JSON or YAML

yq --input-format xml --output-format json $FILE

{
  "root": {
    "myel": [
      {
        "+@name": "Foo"
      },
      {
        "+@name": "Bar",
        "mysubel": "stairway to heaven"
      }
    ]
  }
}

yq --input-format xml $FILE (YAML is the default format)

root:
  myel:
    - +@name: Foo
    - +@name: Bar
      mysubel: stairway to heaven
jpseng
  • 1,618
  • 6
  • 18
4

I am not aware of any pure shell XML parsing tool. So you will most likely need a tool written in an other language.

My XML::Twig Perl module comes with such a tool: xml_grep, where you would probably write what you want as xml_grep -t '/html/head/title' xhtmlfile.xhtml > titleOfXHTMLPage.txt (the -t option gives you the result as text instead of xml)

mirod
  • 15,923
  • 3
  • 45
  • 65
3

Well, you can use xpath utility. I guess perl's XML::Xpath contains it.

alamar
  • 18,729
  • 4
  • 64
  • 97
2

After some research for translation between Linux and Windows formats of the file paths in XML files I found interesting tutorials and solutions on:

user485380
  • 29
  • 1
2

While there are quite a few ready-made console utilities that might do what you want, it will probably take less time to write a couple of lines of code in a general-purpose programming language such as Python which you can easily extend and adapt to your needs.

Here is a python script which uses lxml for parsing — it takes the name of a file or a URL as the first parameter, an XPath expression as the second parameter, and prints the strings/nodes matching the given expression.

Example 1

#!/usr/bin/env python
import sys
from lxml import etree

tree = etree.parse(sys.argv[1])
xpath_expression = sys.argv[2]

#  a hack allowing to access the
#  default namespace (if defined) via the 'p:' prefix    
#  E.g. given a default namespaces such as 'xmlns="http://maven.apache.org/POM/4.0.0"'
#  an XPath of '//p:module' will return all the 'module' nodes
ns = tree.getroot().nsmap
if ns.keys() and None in ns:
    ns['p'] = ns.pop(None)
#   end of hack    

for e in tree.xpath(xpath_expression, namespaces=ns):
    if isinstance(e, str):
        print(e)
    else:
        print(e.text and e.text.strip() or etree.tostring(e, pretty_print=True))

lxml can be installed with pip install lxml. On ubuntu you can use sudo apt install python-lxml.

Usage

python xpath.py myfile.xml "//mynode"

lxml also accepts a URL as input:

python xpath.py http://www.feedforall.com/sample.xml "//link"

Note: If your XML has a default namespace with no prefix (e.g. xmlns=http://abc...) then you have to use the p prefix (provided by the 'hack') in your expressions, e.g. //p:module to get the modules from a pom.xml file. In case the p prefix is already mapped in your XML, then you'll need to modify the script to use another prefix.


Example 2

A one-off script which serves the narrow purpose of extracting module names from an apache maven file. Note how the node name (module) is prefixed with the default namespace {http://maven.apache.org/POM/4.0.0}:

pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modules>
        <module>cherries</module>
        <module>bananas</module>
        <module>pears</module>
    </modules>
</project>

module_extractor.py:

from lxml import etree
for _, e in etree.iterparse(open("pom.xml"), tag="{http://maven.apache.org/POM/4.0.0}module"):
    print(e.text)
ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • This is awesome when you either want to avoid installing extra packages or don't have access to. On a build machine, I can justify an extra `pip install` over `apt-get` or `yum` call. Thanks! – E. Moffat Oct 31 '18 at 00:12
1

Yuzem's method can be improved by inversing the order of the < and > signs in the rdom function and the variable assignments, so that:

rdom () { local IFS=\> ; read -d \< E C ;}

becomes:

rdom () { local IFS=\< ; read -d \> C E ;}

If the parsing is not done like this, the last tag in the XML file is never reached. This can be problematic if you intend to output another XML file at the end of the while loop.

michaelmeyer
  • 7,985
  • 7
  • 30
  • 36
1

While it seems like "never parse XML, JSON... from bash without a proper tool" is sound advice, I disagree. If this is side job, it is waistfull to look for the proper tool, then learn it... Awk can do it in minutes. My programs have to work on all above mentioned and more kinds of data. Hell, I do not want to test 30 tools to parse 5-7-10 different formats I need if I can awk the problem in minutes. I do not care about XML, JSON or whatever! I need a single solution for all of them.

As an example: my SmartHome program runs our homes. While doing it, it reads plethora of data in too many different formats I can not control. I never use dedicated, proper tools since I do not want to spend more than minutes on reading the data I need. With FS and RS adjustments, this awk solution works perfectly for any textual format. But, it may not be the proper answer when your primary task is to work primarily with loads of data in that format!

The problem of parsing XML from bash I faced yesterday. Here is how I do it for any hierarchical data format. As a bonus - I assign data directly to the variables in a bash script.

To make thins easier to read, I will present solution in stages. From the OP test data, I created a file: test.xml

Parsing said XML in bash and extracting the data in 90 chars:

awk 'BEGIN { FS="<|>"; RS="\n" }; /host|username|password|dbname/ { print $2, $4 }' test.xml

I normally use more readable version since it is easier to modify in real life as I often need to test differently:

awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2,$4}' test.xml

I do not care how is the format called. I seek only the simplest solution. In this particular case, I can see from the data that newline is the record separator (RS) and <> delimit fields (FS). In my original case, I had complicated indexing of 6 values within two records, relating them, find when the data exists plus fields (records) may or may not exist. It took 4 lines of awk to solve the problem perfectly. So, adapt idea to each need before using it!

Second part simply looks it there is wanted string in a line (RS) and if so, prints out needed fields (FS). The above took me about 30 seconds to copy and adapt from the last command I used this way (4 times longer). And that is it! Done in 90 chars.

But, I always need to get the data neatly into variables in my script. I first test the constructs like so:

awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml

In some cases I use printf instead of print. When I see everything looks well, I simply finish assigning values to variables. I know many think "eval" is "evil", no need to comment :) Trick works perfectly on all four of my networks for years. But keep learning if you do not understand why this may be bad practice! Including bash variable assignments and ample spacing, my solution needs 120 chars to do everything.

eval $( awk 'BEGIN { FS="<|>"; RS="\n" }; { if ($0 ~ /host|username|password|dbname/) print $2"=\""$4"\"" }' test.xml ); echo "host: $host, username: $username, password: $password dbname: $dbname"
Pila
  • 125
  • 1
  • 5
  • 1
    There are serious security problems with this approach. You don't want a password containing `$(rm -rf ~)` to `eval` that command (and if you changed your injected quotes from double to single, they could then be defeated with `$(rm -rf ~)'$(rm -rf ~)'`). – Charles Duffy Nov 10 '20 at 20:07
  • ...so, if you want to make this safe, you need to *both* (1) switch from injecting double quotes to single quotes; and (2) replace any literal single quotes in the data with a construct like `'"'"'` – Charles Duffy Nov 10 '20 at 20:09
  • Also, `eval "$(...)"`, not just `eval $(...)`. For an example of how the latter leads to buggy results, try `cmd=$'printf \'%s\\n\' \'first * line\''`, and then compare the output of `eval $cmd` to the output of `eval "$cmd"` -- without the quotes, your `*` gets replaced with a list of files in the current directory before `eval` starts its parsing (meaning _those filenames themselves_ get evaluated as code, opening even more potential room for security issues). – Charles Duffy Nov 10 '20 at 20:11
  • 2
    Never parse XML or JSON without a proper tool is sound advice. The only exception would be if you need to stream the input because of it's size. – Ihe Onwuka Oct 06 '21 at 12:45
  • Great solution. For me the value was contained in `$3` variable (MacOS, z-shell terminal). – ElectroBuddha Dec 20 '22 at 12:32
0

This works if you are wanting XML attributes:

$ cat alfa.xml
<video server="asdf.com" stream="H264_400.mp4" cdn="limelight"/>

$ sed 's.[^ ]*..;s./>..' alfa.xml > alfa.sh

$ . ./alfa.sh

$ echo "$stream"
H264_400.mp4
Zombo
  • 1
  • 62
  • 391
  • 407
0

Try xpe. It was built specifically for this purpose. You can install it with python3 pip:

pip3 install xpe

You can use it like so:

curl example.com | xpe '//title'

The above command returns:

Example Domain

pancake
  • 1
  • 1