47

I've a XML file with the contents:

<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>

I need a way to extract what is in the <job..> </job> tags, programmin in this case. This should be done on linux command prompt, using grep/sed/awk.

amarillion
  • 24,487
  • 15
  • 68
  • 80
Zacky112
  • 8,679
  • 9
  • 34
  • 36
  • If your XML file contained this: Tom & Jerry would you want the result to have XML escaping left alone: Tom & Jerry or would you want the escaping to be undone, as an XML parser would: Tom & Jerry If it's the latter, sorry, I don't know how to do that with Unix text tools. – Paul Clapham Feb 09 '10 at 03:04
  • @Paul `s/&/\&/g`, same for `"` etc, of course it won't generalize for user-defined entities etc. – 13ren Feb 10 '10 at 11:54
  • [https://stackoverflow.com/a/17333829/3291390](https://stackoverflow.com/a/17333829/3291390) – Stack Underflow Jan 25 '20 at 03:41

11 Answers11

68

Do you really have to use only those tools? They're not designed for XML processing, and although it's possible to get something that works OK most of the time, it will fail on edge cases, like encoding, line breaks, etc.

I recommend xml_grep:

xml_grep 'job' jobs.xml --text_only

Which gives the output:

programming

On ubuntu/debian, xml_grep is in the xml-twig-tools package.

amarillion
  • 24,487
  • 15
  • 68
  • 80
16
 grep '<job' file_name | cut -f2 -d">"|cut -f1 -d"<"
Vijay
  • 65,327
  • 90
  • 227
  • 319
11

Using xmlstarlet:

echo '<job xmlns="http://www.sample.com/">programming</job>' | \
   xmlstarlet sel -N var="http://www.sample.com/" -t -m "//var:job" -v '.'
lmxy
  • 279
  • 3
  • 3
  • 4
    There is a significant number of different tools which use standard XPath notation to extract information from XML -- `xmlstarlet` is just one. Others include `xmllint`, `xpath`, etc. See http://stackoverflow.com/questions/15461737/how-to-execute-xpath-one-liners-from-shell – tripleee Jun 10 '15 at 07:28
9

Please don't use line and regex based parsing on XML. It is a bad idea. You can have semantically identical XML with different formatting, and regex and line based parsing simply cannot cope with it.

Things like unary tags and variable line wrapping - these snippets 'say' the same thing:

<root>
  <sometag val1="fish" val2="carrot" val3="narf"></sometag>
</root>


<root>
  <sometag
      val1="fish"
      val2="carrot"
      val3="narf"></sometag>
</root>

<root
><sometag
val1="fish"
val2="carrot"
val3="narf"
></sometag></root>

<root><sometag val1="fish" val2="carrot" val3="narf"/></root>

Hopefully this makes it clear why making a regex/line based parser is difficult? Fortunately, you don't need to. Many scripting languages have at least one, sometimes more parser options.

As a previous poster has alluded to - xml_grep is available. That's actually a tool based off the XML::Twig perl library. However what it does is use 'xpath expressions' to find something, and differentiates between document structure, attributes and 'content'.

E.g.:

xml_grep 'job' jobs.xml --text_only

However in the interest of making better answers, here's a couple of examples of 'roll your own' based on your source data:

First way:

Use twig handlers that catches elements of a particular type and acts on them. The advantage of doing it this way is it parses the XML 'as you go', and lets you modify it in flight if you need to. This is particularly useful for discarding 'processed' XML when you're working with large files, using purge or flush:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

XML::Twig->new(
    twig_handlers => {
        'job' => sub { print $_ ->text }
    }
    )->parse( <> );

Which will use <> to take input (piped in, or specified via commandline ./myscript somefile.xml) and process it - each job element, it'll extract and print any text associated. (You might want print $_ -> text,"\n" to insert a linefeed).

Because it's matching on 'job' elements, it'll also match on nested job elements:

<job>programming
    <job>anotherjob</job>
</job>

Will match twice, but print some of the output twice too. You can however, match on /job instead if you prefer. Usefully - this lets you e.g. print and delete an element or copy and paste one modifying the XML structure.

Alternatively - parse first, and 'print' based on structure:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> root -> text;

As job is your root element, all we need do is print the text of it.

But we can be a bit more discerning, and look for job or /job and print that specifically instead:

my $twig = XML::Twig->new( )->parse( <> );
print $twig -> findnodes('/job',0)->text;

You can use XML::Twigs pretty_print option to reformat your XML too:

XML::Twig->new( 'pretty_print' => 'indented_a' )->parse( <> ) -> print;

There's a variety of output format options, but for simpler XML (like yours) most will look pretty similar.

Sobrique
  • 52,974
  • 7
  • 60
  • 101
8

just use awk, no need other external tools. Below works if your desired tags appears in multitine.

$ cat file
test
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">
programming</job>

$ awk -vRS="</job>" '{gsub(/.*<job.*>/,"");print}' file
programming

programming
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • ` job>` is valid, but your script doesn't recognize it. `` is a comment that needs to be ignored (and `<!CDATA[[ ]]>` is literal data), but your script doesn't know *that*. And then there are cases like having a DTD that defines new macros, such that `&foo;` expands to something locally-specified, and the simple cases like needing to convert `&` to `&`. Trying to roll your own XML parsing (or worse, generation) leads to no end of corner cases and little details that need to be individually run down and fixed. – Charles Duffy Sep 25 '17 at 14:28
6

Using sed command:

Example:

$ cat file.xml
<note>
        <to>Tove</to>
                <from>Jani</from>
                <heading>Reminder</heading>
        <body>Don't forget me this weekend!</body>
</note>

$ cat file.xml | sed -ne '/<heading>/s#\s*<[^>]*>\s*##gp'
Reminder

Explanation:

cat file.xml | sed -ne '/<pattern_to_find>/s#\s*<[^>]*>\s*##gp'

n - suppress printing all lines
e - script

/<pattern_to_find>/ - finds lines that contain specified pattern what could be e.g.<heading>

next is substitution part s///pthat removes everything except desired value where / is replaced with # for better readability:

s#\s*<[^>]*>\s*##gp
\s* - includes white-spaces if exist (same at the end)
<[^>]*> represents <xml_tag> as non-greedy regex alternative cause <.*?> does not work for sed
g - substitutes everything e.g. closing xml </xml_tag> tag

vldbnc
  • 429
  • 5
  • 5
5

Assuming same line, input from stdin:

sed -ne '/<\/job>/ { s/<[^>]*>\(.*\)<\/job>/\1/; p }'

notes: -n stops it outputting everything automatically; -e means it's a one-liner (aot a script) /<\/job> acts like a grep; s strips the opentag + attributes and endtag; ; is a new statement; p prints; {} makes the grep apply to both statements, as one.

13ren
  • 11,887
  • 9
  • 47
  • 64
0

How about:

cat a.xml | grep '<job' | cut -d '>' -f 2 | cut -d '<' -f 1
codaddict
  • 445,704
  • 82
  • 492
  • 529
  • 4
    UUOC. `grep ' – ghostdog74 Feb 08 '10 at 23:53
  • @ghost *but but but, I think it's cleaner / nicer / not that much of a waste / my privelege to waste processes!* http://partmaps.org/era/unix/award.html#cat (actually, I think it's easier to edit the filename, because nearer the start) – 13ren Feb 10 '10 at 12:13
  • 3
    If you use `< a.xml | grep ...` you get it even closer to the start. – Thor Aug 23 '12 at 13:11
0

A bit late to the show.

xmlcutty cuts out nodes from XML:

$ cat file.xml
<?xml version="1.0" encoding="utf-8"?>
<job xmlns="http://www.sample.com/">programming</job>
<job xmlns="http://www.sample.com/">designing</job>
<job xmlns="http://www.sample.com/">managing</job>
<job xmlns="http://www.sample.com/">teaching</job>

The path argument names the path to the element you want to cut out. In this case, since we are not interested in the tags at all, we rename the tag to \n, so we get a nice list:

$ xmlcutty -path /job -rename '\n' file.xml
programming
designing
managing
teaching

Note, that the XML was not valid to begin with (no root element). xmlcutty can work with slightly broken XML, too.

miku
  • 181,842
  • 47
  • 306
  • 310
0

yourxmlfile.xml

<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
</item>

grep 'title' yourxmlfile.xml

  <title>15:54:57 - George:</title>
  <title>15:55:17 - Jerry:</title>

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}'

  15:54:57 - George:</title
  15:55:17 - Jerry:</title

grep 'title' yourxmlfile.xml | awk -F">" '{print $2}' | awk -F"<" '{print $1}'

  15:54:57 - George:
  15:55:17 - Jerry:
m.nguyencntt
  • 935
  • 13
  • 19
0

Use xml2 to use line-oriented tools with XML

Example:

xml2 <foo.xml | sed -n 's#.*/job=##p'

Output:

programming

Where to get xml2

The xml2 command often can be installed using your system's package manager (for example, apt install xml2). It can also be downloaded from https://github.com/cryptorick/xml2.

xml2 Documentation

Why xml2 is needed

Naive use of grep, sed, and awk is brittle. Consider the following XML file which would break such solutions:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

Why not xml_grep and friends

Most of the robust answers to this question suggest using tools, such as xml_grep, which search using the XPath syntax. XPath is designed especially for searching XML documents and is a fine solution if you already know XPath or don't know anything else.

However, if you just need to search XML files and know the standard UNIX tools, it may not be worth your time to learn XPath which has limited utility beyond XML. Fortunately, xml2 provides an easy way to leverage the power of UNIX and regular expressions by converting the XML syntax to a "flat file" format in which each record is on a single line.

Example xml2 output

For example, running xml2 < foo.xml on the following file:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <job xmlns="http://www.sample.com/">programming</job>
  <job xmlns="http://www.supple.com/">designing</job>
  <job xmlns="http://www.simple.com/">managing</job>
  <job xmlns=
       "http://www.people.com/"
       val1="fish" val2="carrot"
       val3="narf"
       >teaching<!-- A comment about the </job> tag --></job>
</root>

would output the following text file:

/root/job/@xmlns=http://www.sample.com/
/root/job=programming
/root/job
/root/job/@xmlns=http://www.supple.com/
/root/job=designing
/root/job
/root/job/@xmlns=http://www.simple.com/
/root/job=managing
/root/job
/root/job/@xmlns=http://www.people.com/
/root/job/@val1=fish
/root/job/@val2=carrot
/root/job/@val3=narf
/root/job=teaching
/root/job/!= A comment about the </job> tag 

As you can see, the peculiarities of the XML file have been made normalized and the output can be easily parsed by grep, sed, or awk. In particular, the command xml2 <foo.xml | sed -n 's#.*/job=##p' outputs:

programming
designing
managing
teaching

Side note: I added a <root> node to make the file valid XML although xml2 works fine either way.

Limitations

While xml2 is very handy for search and replace, if you are going to be doing a lot of work with XML, you'll probably want to learn XPath and XSLT which can perform more powerful hierarchical transformations.

hackerb9
  • 1,545
  • 13
  • 14