How to extract characters between the delimiters using sed?

Question

I have just started learning sed. I want to extract and print the characters between the > and < delimiters. Here the text in my data file:

<span id="ctl00_ContentPlaceHolder1_lblRollNo">12029</span>

   <br /><b>Engineering & IT/Computer Science</b><br />

        <div id="ctl00_ContentPlaceHolder1_divEngITMerit">

                        <span id="ctl00_ContentPlaceHolder1_lblEngITSelListNo">3rd Provisional Selection List</span>

                <tr><td style='width: 200px' class='TblTRData'>IT/Computer Science/Software</td><td style='width: 150px'class='TblTRData'>7 (out of 471)</td><td style='width: 325px'class='TblTRData'>Selected in MS COMPUTER SCIENCE</td></tr>

                                Name:

                                <span id="ctl00_ContentPlaceHolder1_lblName">SIDRA SHAHID</span>

                                Father Name:

                                <span id="ctl00_ContentPlaceHolder1_lblFatherName">SHAHID RAFEEQ AHMAD</span>

I have written the command:

sed -n -e '/^[^>]*>\([^<]*\)<.*/s//\1/p' myfile.txt

The problem is that it is returning the text between some of the > <. For example, it prints 12029, but not Selected in Selected in MS COMPUTER SCIENCE. What am I doing wrong?

you should use an xml parser instead. What if you have entities thereinside? — Benoit, Oct 07 '11 at 08:31
I'll just drop this link into the comments in case anyone happens to find it useful: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — johnsyweb, Oct 07 '11 at 09:36

score 1 · Answer 1 · edited Dec 02 '11 at 17:00

1

If you need to extract only strings between tags, this means you need to delete tags leaving strings between them untouched. Right?

sed 's/<[^>]*>//g'

It substitutes (all occurrences) of tag ( "<" everything upon next ">" ) with empty string (nothing). Text will remain.

edited Dec 02 '11 at 17:00

Jasper

75,717
14
151
146

answered Dec 02 '11 at 16:43

user1077830

31
1

Benoit · Answer 2 · 2011-10-07T08:59:13.040

0

In sed, the s command has a g flag to operate on all pattern occurrences on a same line.

s/>\([^<]*\)</\1/pg

might suffice.

edited Oct 07 '11 at 08:59

answered Oct 07 '11 at 08:35

Benoit

76,634
23
210
236

@mainajaved: and with this regex? – Benoit Oct 07 '11 at 08:59
@mainajavaed : Unless your sed script is invoked with the `-n` option, you might try removing the 'p' at the end of that command. it means print, so any time you have a successful match the line is printed, which, if you don't have the `-n` option, can lead to some confusing output. BUT more importantly, per the link from Johnsweb and Benoit's original comment parsing XML with any reg-ex tool will never have any long term sucess. If as you say you're trying to learn sed, this is **really** not the sort of topic to start learning with. Good luck. – shellter Oct 07 '11 at 13:15

How to extract characters between the delimiters using sed?

2 Answers2

Linked

Related