60

I'm trying to extract a value from an xml document that has been read into my script as a variable. The original variable, $data, is:

<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
</item> 

and I wish to extract the first title value, so

15:54:57 - George:

I've been using the sed command:

title=$(sed -n -e 's/.*<title>\(.*\)<\/title>.*/\1/p' <<< $data)

but this only outputs the second title value:

15:55:17 - Jerry:

Does anyone know what I have done wrong? Thanks!

rgamber
  • 5,749
  • 10
  • 55
  • 99
Pete
  • 1,095
  • 3
  • 9
  • 17
  • The test data you gave is not a valid XML document. What does your _real_ data look like? – Charles Duffy Jun 27 '13 at 03:15
  • No-repro. I get both. – Kevin Jun 27 '13 at 03:24
  • 1
    ...to be clearer about what I mean by "not a valid XML document" -- it has no root element, and its tags aren't all closed. This matters; an answer that's intended to handle invalid XML will be different from one that can assume legitimately compliant input. – Charles Duffy Jun 27 '13 at 03:28

3 Answers3

106

As Charles Duffey has stated, XML parsers are best parsed with a proper XML parsing tools. For one time job the following should work.

grep -oPm1 "(?<=<title>)[^<]+"

Test:

$ echo "$data"
<item> 
  <title>15:54:57 - George:</title>
  <description>Diane DeConn? You saw Diane DeConn!</description> 
</item> 
<item> 
  <title>15:55:17 - Jerry:</title> 
  <description>Something huh?</description>
$ title=$(grep -oPm1 "(?<=<title>)[^<]+" <<< "$data")
$ echo "$title"
15:54:57 - George:
Community
  • 1
  • 1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • 1
    `perl` solution would be: `perl -ne 'print and last if s/.*(.*)<\/title>.*/\1/;' <<< "$data"` though as Charles Duffy suggested in comments, xml parsing is best done with xml parsers. This should be good for one off hacks. – jaypal singh Jun 27 '13 at 03:21
  • Thanks a lot. I tried a bunch of `sed` commands, and eventually got it to work, but then it turned out that it worked differently in different terminals >sigh<. So, `grep` to the rescue :o) I extended it a bit `grep -oPm1 "(?<=)REL.P*[^]+" pom.xml`. This means: Give me the string between `` and `` which begins with `REL.P`, e.g. `REL.P.02.03.04`. – Jonas Bang Christensen Feb 20 '14 at 08:46
  • Telling people to use regular expressions for a clearly non regular language is doomed and should not be the accepted answer – ooxi May 11 '14 at 12:47
  • 1
    @ooxi Thanks for leaving a comment. I won't argue with your down vote as I do agree with you. I never this is the only way. In fact I did state in my comments that xml parsers should be used. However for one time job I don't expect anyone to learn an xml parser. Clearly it worked out for OP. – jaypal singh May 11 '14 at 13:30
  • 1
    @JonasBang, perhaps you mean `REL[.]P.*`? `REL.P*` is something quite different. – Charles Duffy May 11 '14 at 13:34
  • 1
    @jaypal You are right it would do for a one time job like OP has posted. But since this post is in the top results in google when searching for this topic and thus people will start copy pasting the accepted solution I would be more comfortable with a correct answer which scales beyond the question of OP – ooxi May 11 '14 at 13:38
  • 1
    @ooxi Agreed. I updated the answer to issue a warning and embedded a link to the answer from Charles. – jaypal singh May 11 '14 at 13:46
  • 1
    There is no "-P" option for grep on Mac. It does work on Linux (tested on Ubuntu 20.04 LTS and Debian 11). Upvoted the "sed" solution from doubleDown below, because it works on both. – user3589608 May 31 '22 at 21:06
35

XMLStarlet or another XPath engine is the correct tool for this job.

For instance, with data.xml containing the following:

<root>
  <item> 
    <title>15:54:57 - George:</title>
    <description>Diane DeConn? You saw Diane DeConn!</description> 
  </item> 
  <item> 
    <title>15:55:17 - Jerry:</title> 
    <description>Something huh?</description>
  </item>
</root>

...you can extract only the first title with the following:

xmlstarlet sel -t -m '//title[1]' -v . -n <data.xml

Trying to use sed for this job is troublesome. For instance, the regex-based approaches won't work if the title has attributes; won't handle CDATA sections; won't correctly recognize namespace mappings; can't determine whether a portion of the XML documented is commented out; won't unescape attribute references (such as changing Brewster &amp; Jobs to Brewster & Jobs), and so forth.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Thanks Charles, I wanted the script to be portable so I didn't think xmlstarlet would be best since I'd need to install the package on each system that I would want to use it on. – Pete Jun 27 '13 at 04:00
  • 1
    @Pete This is a case where you need to decide between portability and correctness; you can't have both. The answer you accepted will give demonstrably wrong output in some common situations (and a very large number of uncommon ones). – Charles Duffy Jun 27 '13 at 04:02
  • Good tool, a little complicated to use but after 30 min of `man` and testing I got what I needed. thanks! – JorgeeFG Oct 28 '13 at 19:34
15

I agree with Charles Duffy that a proper XML parser is the right way to go.

But as to what's wrong with your sed command (or did you do it on purpose?).

  • $data was not quoted, so $data is subject to shell's word splitting, filename expansion among other things. One of the consequences being that the spacing in the XML snippet is not preserved.

So given your specific XML structure, this modified sed command should work

title=$(sed -ne '/title/{s/.*<title>\(.*\)<\/title>.*/\1/p;q;}' <<< "$data")

Basically for the line that contains title, extract the text between the tags, then quit (so you don't extract the 2nd <title>)

kjhughes
  • 106,133
  • 27
  • 181
  • 240
doubleDown
  • 8,048
  • 1
  • 32
  • 48
  • I added a semicolon after 'q' in sed command. This still works with Linux distros and resolves issue on MacOS where error otherwise occurs: *"extra characters at the end of q command"* – kjhughes Jan 22 '15 at 17:42
  • Thank you! Needed this to work on both, Mac and Linux. And your solution fit the bill. Top-rated solution from @jaypal singh did not work for me, because there is no "-P" option for grep on MAC. – user3589608 May 31 '22 at 21:08