How to parse XML using shellscript?

Question

I would like to know what would be the best way to parse an XML file using shellscript ?

Should one do it by hand ?
Does third tiers library exist ?

If you already made it if you could let me know how did you manage to do it

Do you have to do it in Shell? I know its easy to do it in Perl or Python. — M99, Jan 13 '11 at 12:48
Once the parser has parsed the xml what do you want to do with it? — Joel, Jan 13 '11 at 17:29
@Joel : I wan to retrieve email adress from a defined xml file and send mails to those email address. — Spredzy, Jan 14 '11 at 09:32
Can you use an xpath to do this? In which case xmlint, as per my answer, my serve you. — Joel, Jan 14 '11 at 10:39

score 100 · Accepted Answer · edited May 27 '16 at 02:19

100

You could try xmllint

The xmllint program parses one or more XML files, specified on the command line as xmlfile. It prints various types of output, depending upon the options selected. It is useful for detecting errors both in XML code and in the XML parser itse

It allows you select elements in the XML doc by xpath, using the --pattern option.

On Mac OS X (Yosemite), it is installed by default.
On Ubuntu, if it is not already installed, you can run apt-get install libxml2-utils

edited May 27 '16 at 02:19

Jason

9,408
5
36
36

answered Jan 13 '11 at 17:27

Joel

29,538
35
110
138

1

this, for example, would extract the text value from the body of the sample_type tag (used just now for TCGA result sets). {{ xmllint --xpath 'sample_type/text()' result.xml }} – Erik Aronesty May 02 '13 at 14:03
1

in Ubuntu 10.04.4 LTS it is not installed by default (at least it wasn't on my machine). Needed to run `apt-get install libxml2-utils` to use it. – Harry Aug 21 '13 at 12:30

score 24 · Answer 2 · edited Nov 16 '22 at 07:34

24

Here's a fully working example.

If it's only extracting email addresses you could do something like:

Suppose XML file spam.xml is like

<spam>
<victims>
  <victim>
    <name>The Pope</name>
    <email>pope@vatican.gob.va</email>
    <is_satan>0</is_satan>
  </victim>
  <victim>
    <name>George Bush</name>
    <email>father@nwo.com</email>
    <is_satan>1</is_satan>
  </victim>
  <victim>
    <name>George Bush Jr</name>
    <email>son@nwo.com</email>
    <is_satan>0</is_satan>
  </victim>
</victims>
</spam>

You can get the emails and process them with this short bash code:

#!/bin/bash
emails=($(grep -oP '(?<=email>)[^<]+' "/my_path/spam.xml"))

for i in ${!emails[*]}
do
  echo "$i" "${emails[$i]}"
  # instead of echo use the values to send emails, etc
done

The result of this example is:

0 pope@vatican.gob.va
1 father@nwo.com
2 son@nwo.com

Important note:
Don't use this for serious matters. This is OK for playing around, getting quick results, learning to grep, etc. but you should definitely look for, learn and use an XML parser for production (see Micha's comment below).

edited Nov 16 '22 at 07:34

seenukarthi

8,241
10
47
68

answered Jun 06 '14 at 18:01

aesede

5,541
2
35
33

That was exactly what I was looking for. It is working as supposed to, but I fail to understand the -o and -P arguments and the expression you are using for the grep. Can you explain it? Just trying to learn something new. – thexpand Jul 13 '14 at 20:35
3

Hi, the `-o` or `--only-matching` means "only show the matching part", in this case the emails. The `-P` or `--perl-regexp` means "use a regular expression as if it was Perl". You can see this and all other options just by doing `grep --help` in the command line. Also you can do `man grep` for the full manual. – aesede Jul 14 '14 at 20:23
2

Also worth noticing that this is a quick and dirty way of getting emails from an XML document. You could get same result with this in the command line: `for email in $(cat /my_path/spam.xml | grep -oP '(?<=email>)[^<]+'); do echo "$email"; done` If you aim to use it for production you should **definetly** use an XML parser. In my case I use Python scripts with [lxml](http://lxml.de/) – aesede Oct 16 '15 at 18:50
2

Why are you using grep? You must not use a regex to parse xml: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Micha Wiedenmann Mar 09 '16 at 07:56
2

Yeah @MichaWiedenmann that post is a classic, a must read for anybody! Notice that I don't recommend my solution for production, just for quick and dirty things from the command line. You should always use an XML parser for real life stuff. – aesede Mar 09 '16 at 13:42
(I know I'm wrong, but I feel using regex instead of a parser is dirtier, and maybe I like unimportant things to be dirty, it helps thinking outside of the box) – aesede Mar 09 '16 at 13:54
Here is an enhanced update to @aesede answer for Google mail filter elements: `#!/bin/bash` `element_start=""` `file_addres="./Google_mail_filter.xml"` `elemental_list=($(egrep -o "${element_start}.*.${element_end}$" $file_address |sed -e "s|${element_start}||g; s|${element_end}||g"))` `echo ${elemental_list[*]} ` – PaSe May 26 '17 at 17:14
Does not work keeps saying file not found and -P option is not there for grep on a Mac – JPM Aug 28 '23 at 19:58

score 13 · Answer 3 · answered Jan 13 '11 at 15:57

13

There's also xmlstarlet (which is available for Windows as well).

http://xmlstar.sourceforge.net/doc/xmlstarlet.txt

answered Jan 13 '11 at 15:57

tim

131
2

score 11 · Answer 4 · edited Jan 31 '13 at 07:48

I am surprised no one has mentioned xmlsh. The mission statement :

A command line shell for XML Based on the philosophy and design of the Unix Shells

xmlsh provides a familiar scripting environment, but specifically tailored for scripting xml processes.

A list of shell like commands are provided here.

I use the xed command a lot which is equivalent to sed for XML, and allows XPath based search and replaces.

score 10 · Answer 5 · answered Jan 13 '11 at 12:46

10

Try sgrep. It's not clear exactly what you are trying to do, but I surely would not attempt writing an XML parser in bash.

answered Jan 13 '11 at 12:46

Keith

42,110
11
57
76

3

Hear hear, I wrote a "parser" (I wouldn't really call it parser, although it worked quite well) for JSON using sed/awk, it was a nightmare. – Anders Jan 13 '11 at 12:50

score 7 · Answer 6 · answered Jan 13 '11 at 17:05

7

Do you have xml_grep installed? It's a perl based utility standard on some distributions (it came pre-installed on my CentOS system). Rather than giving it a regular expression, you give it an xpath expression.

answered Jan 13 '11 at 17:05

frankc

11,290
4
32
49

Yes I had more success with xml_grep, here is an example to get the connect string from a Jboss XML: `xml_grep '/domain/profiles/profile[@name="server1"]//datasources//connection-url' domain.xml` matches: < profiles> ....... See XPath syntax – phil_w Nov 11 '15 at 17:52

score 5 · Answer 7 · answered Jan 13 '11 at 18:29

5

A rather new project is the xml-coreutils package featuring xml-cat, xml-cp, xml-cut, xml-grep, ...

http://xml-coreutils.sourceforge.net/contents.html

answered Jan 13 '11 at 18:29

user321

51
1

score 4 · Answer 8 · answered Feb 21 '12 at 20:18

4

Try using xpath. You can use it to parse elements out of an xml tree.

http://www.ibm.com/developerworks/xml/library/x-tipclp/index.html

answered Feb 21 '12 at 20:18

Mark Rose

971
9
15

score 3 · Answer 9 · edited Oct 15 '20 at 01:20

This really is beyond the capabilities of shell script. Shell script and the standard Unix tools are okay at parsing line oriented files, but things change when you talk about XML. Even simple tags can present a problem:

<MYTAG>Data</MYTAG>

<MYTAG>
     Data
</MYTAG>

<MYTAG param="value">Data</MYTAG>

<MYTAG><ANOTHER_TAG>Data
</ANOTHER_TAG><MYTAG>

Imagine trying to write a shell script that can read the data enclosed in . The three very, very simply XML examples all show different ways this can be an issue. The first two examples are the exact same syntax in XML. The third simply has an attribute attached to it. The fourth contains the data in another tag. Simple sed, awk, and grep commands cannot catch all possibilities.

You need to use a full blown scripting language like Perl, Python, or Ruby. Each of these have modules that can parse XML data and make the underlying structure easier to access. I've use XML::Simple in Perl. It took me a few tries to understand it, but it did what I needed, and made my programming much easier.

i think what he meant was... is there a standard command line parser like xmllint...to extract a value from an xml file — Erik Aronesty, May 02 '13 at 13:59

Ed K · Answer 10 · 2012-11-27T20:31:15.500

Here's a solution using xml_grep (because xpath wasn't part of our distributable and I didn't want to add it to all production machines)...

If you are looking for a specific setting in an XML file, and if all elements at a given tree level are unique, and there are no attributes, then you can use this handy function:

# File to be parsed
xmlFile="xxxxxxx"

# use xml_grep to find settings in an XML file
# Input ($1): path to setting
function getXmlSetting() {

    # Filter out the element name for parsing
    local element=`echo $1 | sed 's/^.*\///'`

    # Verify the element is not empty
    local check=${element:?getXmlSetting invalid input: $1}

    # Parse out the CDATA from the XML element
    # 1) Find the element (xml_grep)
    # 2) Remove newlines (tr -d \n)
    # 3) Extract CDATA by looking for *element> CDATA <element*
    # 4) Remove leading and trailing spaces
    local getXmlSettingResult=`xml_grep --cond $1 $xmlFile 2>/dev/null | tr -d '\n' | sed -n -e "s/.*$element>[[:space:]]*\([^[:space:]].*[^[:space:]]\)[[:space:]]*<\/$element.*/\1/p"`

    # Return the result
    echo $getXmlSettingResult
}

#EXAMPLE
logPath=`getXmlSetting //config/logs/path`
check=${logPath:?"XML file missing //config/logs/path"}

This will work with this structure:

<config>
  <logs>
     <path>/path/to/logs</path>
  <logs>
</config>

It will also work with this (but it won't keep the newlines):

<config>
  <logs>
     <path>
          /path/to/logs
     </path>
  <logs>
</config>

If you have duplicate <config> or <logs> or <path>, then it will only return the last one. You can probably modify the function to return an array if it finds multiple matches.

FYI: This code works on RedHat 6.3 with GNU BASH 4.1.2, but I don't think I'm doing anything particular to that, so should work everywhere.

NOTE: For anybody new to scripting, make sure you use the right types of quotes, all three are used in this code (normal single quote '=literal, backward single quote `=execute, and double quote "=group).

How to parse XML using shellscript?

10 Answers10

Linked

Related