Extract text with sed

Question

I have this text file (it's really a part of an html):

<tr>
              <td width="10%" valign="top"><P>Name:</P></td>
              <td colspan="2"><P>
                XXXXX
              </P></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>City:</p></td>
              <td colspan="2"><p>
                Mycity
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>County:</p></td>
              <td colspan="2"><p>
                YYYYYY
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>Map:</p></td>
              <td colspan="2"><p>
                ZZZZZZZZ

I've used this sed command to extract "Mycity"

$ tr -d '\n' < file.html | sed -n 's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'

The regular expression as far as I know works but I get

Map:

Instead of Mycity.

I've tested the REGEX with Rubular and works but not with sed. Is sed not the right tool? What I¡m I doing wrong?

PS: I'm using Linux

I don't think `<` and `>` need to be escaped and I don't know xml. — PerseP, May 23 '15 at 13:45
There is no such thing as a regexp, there is only a regexp in the context of whatever tool you are using so some tool that tests regexps is only marginally useful. There are BREs and EREs and PCREs and then you have to worry about the regexp or other delimiters the tool uses (e.g. `/` or `'`) and other things. So you can't think a regexp "works" just because some online (or other) tool thinks it's valid regexp syntax in some context. You would only need to escape `<` or `>` chars if you were using them as delimiters for your script or regexp, which you aren't. — Ed Morton, May 24 '15 at 12:23

score 2 · Accepted Answer · answered May 23 '15 at 13:48

2

The problem that you have right now is that regex is greedy by default

's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
                     ^ // here!

So it's matching everything up to the last section. To be non-greedy use a ?

's/.*City:<\/p><\/td>.*?<p>\(.*\)<\/p><\/td>.*/\1/p'
                       ^

answered May 23 '15 at 13:48

jcuenod

55,835
14
65
102

Yes that almost made it work but I also had to add the \s* to ignore some blank spaces before and after "Mycity" so the resulting regex is `'s/.*City:<\/p><\/td>.*?
\(.*\)<\/p><\/td>.*/\1/p'`
– PerseP May 23 '15 at 14:18
sed doesn't support non-greedy operators so this will not help. – Ed Morton May 24 '15 at 12:31
@EdMorton interesting, I didn't realise that. Why is it accepted answer though? – jcuenod May 24 '15 at 20:11
@jcuenod Maybe the OP switched to perl? – Ed Morton May 24 '15 at 21:49
It did work in Rubular but when I tested it in my dev machine it didn't but because it had accented characters – PerseP May 27 '15 at 08:37

score 2 · Answer 2 · edited May 23 '17 at 11:51

2

sed is always the wrong tool for anything that involves processing multiple lines. Just use awk, it's what it was invented to do:

$ awk 'c&&!--c; /City:/{c=2}' file.html
                Mycity

See Printing with sed or awk a line following a matching pattern

edited May 23 '17 at 11:51

Community

1
1

answered May 24 '15 at 12:35

Ed Morton

188,023
17
78
185

It just looks so weird – PerseP May 27 '15 at 19:27
1

That particular construct does but you wouldn't normally write awk code that's so non-obvious, it's just that that is one of a particular set of awk idioms whose brevity and consistency makes them useful despite their lack of clarity. Otherwise awk is just a subset of C with associative arrays, and an implicit `while read field1 field 2 ... fieldN` loop and a couple of additional features to make text processing easy. Awk is THE general purpose UNIX text processing tool so everyone benefits from knowing it. Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins. – Ed Morton May 27 '15 at 21:58

Extract text with sed

2 Answers2