1

What command can you use to find patterns in a block of text in Unix? I need to find what appears between <title> and </title> (which appears several times in my block of text). I tried using

sed -n'/<title>,<\/title>/p' 

but it seems to print everything between the first instance of <title> and the last instance of </title>.

Emmet
  • 6,192
  • 26
  • 39
boop
  • 57
  • 4

2 Answers2

2

This looks like it might be an XML question, or maybe HTML that is “not quite XML”, in which case there are utilities that enable you to extract particular parts of the document according to XPath. If you can install software, you might try:

xgrep -x //title <your file>

There are dozens of little utilities like this of varying degrees of maturity and ability to handle quirks (like parsing HTML that is not well-formed XML).

If you really have to fall back on doing this with regular expressions, assuming that your file is called tagsoup.in, and looks something like this:

<blah>
  <title>One line title</title>
  <p>foo</p>
  <p>bar</p>
  <title>Multi
line
title
  </title>
  <p>foo</p>
  <p>bar</p>
</blah>

Then the following line of sed will extract the one-line title, but not the multiline title:

sed -n 's/<title>\([^<]\+\)<\/title>/\1/p' tagsoup.in

The following sed will extract single-line and multiline content, but runs the risk of loading the whole file into memory if the end tag is not found:

sed -n '
/<title>\(.*\)/ {           # If the line matches the start tag:
    s//\1/                  #   Keep stuff after the start tag
    /<\/title>/!{           #   If the end-tag is *NOT* on this line
        h                   #     Save to hold space
        : loop              #     
        n                   #     Go on to the next line
        /\(.*\)<\/title>/{  #     If we match the end tag
            s//\1/          #       Keep stuff up to the start tag
            H               #       Append to hold space
            g               #       Fetch hold space to pattern space
            s/\n/ /g        #       Replace newlines with spaces
            p               #       Print out pattern space
        }
        /<\/title>/!{       #     If we do NOT match the end tag
            H               #       Append this line to hold space
            b loop          #       Go back and try the next line
        }
    }    
    /\(.*\)<\/title>/{      # If the end-tag *IS* on this line
        s//\1/              #   Keep stuff before the end tag
        p                   #   Print the one-line title
    }
}' tagsoup.in
Emmet
  • 6,192
  • 26
  • 39
  • @Emmett, yeah I need to do it with regular expressions or some other command that I can automatically use in Unix. – boop Mar 10 '14 at 18:57
  • @boop: do the begin and end tags always appear on the same line, or can they appear on different lines? – Emmet Mar 10 '14 at 19:05
  • @Emmett, I think they are always on the same line. – boop Mar 10 '14 at 19:06
  • @boop: I've added a line of sed for that case. – Emmet Mar 10 '14 at 19:10
  • @Emmett: It seems to cause nothing to print :( – boop Mar 10 '14 at 19:14
  • @boop: Can you provide an example represents what you have in your file? – Emmet Mar 10 '14 at 19:21
  • +1 for exceptionally detailed answer AND tech support to boot. There is a question already on S.O. that has a hysterically funny answer about why you shouldn't use reg-exp for XML, but I don't have a link handy. Others will likely supply it. Good luck to all. – shellter Mar 10 '14 at 20:14
  • @shellter: Thank you. I agree in general, but there are rare circumstances where you may not have the luxury of using *xmlstarlet* or whatever. – Emmet Mar 10 '14 at 20:25
  • I will be referring to your answer when I'm asked to do some XML munging. It very likely will save me having to get software installed to our prod environment. Thanks for sharing. – shellter Mar 10 '14 at 21:16
  • I would add `xml` and `sed` as tags to this question so people can find your well explained solution. Good luck to all. – shellter Mar 10 '14 at 23:24
  • @shelter : http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Raul Andres Mar 11 '14 at 14:18
0

Works on mono and multiline (for GNU sed certainly need to add -e)

sed -n '1h;1!H;${x
   s/<title>/²/g;s|</title>|³|g
: again
   s/[^²]*²\([^³]*\)³/\1³/
   t print
   b
: print
   h;s/³.*//
i\
++ Title:
   p
   g;s/[^³]*³//
   t again
   }' YourFile

using

  • Delimiter (² and ³ buat any other unused char is OK) as workaround on "non text block" regex limitation.
  • iterative process to extrat all line
  • need to first load the whole file in the buffer (1h;1!H;${x)
  • I just add a output separator (i\ ++ Title:
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43