This looks like it might be an XML question, or maybe HTML that is “not quite XML”, in which case there are utilities that enable you to extract particular parts of the document according to XPath. If you can install software, you might try:
xgrep -x //title <your file>
There are dozens of little utilities like this of varying degrees of maturity and ability to handle quirks (like parsing HTML that is not well-formed XML).
If you really have to fall back on doing this with regular expressions, assuming that your file is called tagsoup.in
, and looks something like this:
<blah>
<title>One line title</title>
<p>foo</p>
<p>bar</p>
<title>Multi
line
title
</title>
<p>foo</p>
<p>bar</p>
</blah>
Then the following line of sed
will extract the one-line title, but not the multiline title:
sed -n 's/<title>\([^<]\+\)<\/title>/\1/p' tagsoup.in
The following sed
will extract single-line and multiline content, but runs the risk of loading the whole file into memory if the end tag is not found:
sed -n '
/<title>\(.*\)/ { # If the line matches the start tag:
s//\1/ # Keep stuff after the start tag
/<\/title>/!{ # If the end-tag is *NOT* on this line
h # Save to hold space
: loop #
n # Go on to the next line
/\(.*\)<\/title>/{ # If we match the end tag
s//\1/ # Keep stuff up to the start tag
H # Append to hold space
g # Fetch hold space to pattern space
s/\n/ /g # Replace newlines with spaces
p # Print out pattern space
}
/<\/title>/!{ # If we do NOT match the end tag
H # Append this line to hold space
b loop # Go back and try the next line
}
}
/\(.*\)<\/title>/{ # If the end-tag *IS* on this line
s//\1/ # Keep stuff before the end tag
p # Print the one-line title
}
}' tagsoup.in