0

I have this text file (it's really a part of an html):

<tr>
              <td width="10%" valign="top"><P>Name:</P></td>
              <td colspan="2"><P>
                XXXXX
              </P></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>City:</p></td>
              <td colspan="2"><p>
                Mycity
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>County:</p></td>
              <td colspan="2"><p>
                YYYYYY
              </p></td>
            </tr>
            <tr>
              <td width="10%" valign="top"><p>Map:</p></td>
              <td colspan="2"><p>
                ZZZZZZZZ

I've used this sed command to extract "Mycity"

$ tr -d '\n' < file.html | sed -n 's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'

The regular expression as far as I know works but I get

Map:

Instead of Mycity.

I've tested the REGEX with Rubular and works but not with sed. Is sed not the right tool? What I¡m I doing wrong?

PS: I'm using Linux

PerseP
  • 1,177
  • 2
  • 14
  • 22
  • Don't you need to escape `<` & `>` characters? – jcuenod May 23 '15 at 13:33
  • 4
    Also, why don't you parse it as xml? – jcuenod May 23 '15 at 13:34
  • I don't think `<` and `>` need to be escaped and I don't know xml. – PerseP May 23 '15 at 13:45
  • 1
    There is no such thing as a regexp, there is only a regexp in the context of whatever tool you are using so some tool that tests regexps is only marginally useful. There are BREs and EREs and PCREs and then you have to worry about the regexp or other delimiters the tool uses (e.g. `/` or `'`) and other things. So you can't think a regexp "works" just because some online (or other) tool thinks it's valid regexp syntax in some context. You would only need to escape `<` or `>` chars if you were using them as delimiters for your script or regexp, which you aren't. – Ed Morton May 24 '15 at 12:23
  • @jcuenod can you recommend a tool for the parsing as xml? – PerseP May 27 '15 at 20:26
  • http://stackoverflow.com/a/9204139/123415 – jcuenod May 27 '15 at 20:39

2 Answers2

2

The problem that you have right now is that regex is greedy by default

's/.*City:<\/p><\/td>.*<p>\(.*\)<\/p><\/td>.*/\1/p'
                     ^ // here!

So it's matching everything up to the last section. To be non-greedy use a ?

's/.*City:<\/p><\/td>.*?<p>\(.*\)<\/p><\/td>.*/\1/p'
                       ^
jcuenod
  • 55,835
  • 14
  • 65
  • 102
2

sed is always the wrong tool for anything that involves processing multiple lines. Just use awk, it's what it was invented to do:

$ awk 'c&&!--c; /City:/{c=2}' file.html
                Mycity

See Printing with sed or awk a line following a matching pattern

Community
  • 1
  • 1
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • It just looks so weird – PerseP May 27 '15 at 19:27
  • 1
    That particular construct does but you wouldn't normally write awk code that's so non-obvious, it's just that that is one of a particular set of awk idioms whose brevity and consistency makes them useful despite their lack of clarity. Otherwise awk is just a subset of C with associative arrays, and an implicit `while read field1 field 2 ... fieldN` loop and a couple of additional features to make text processing easy. Awk is THE general purpose UNIX text processing tool so everyone benefits from knowing it. Get the book Effective Awk Programming, 4th Edition, by Arnold Robbins. – Ed Morton May 27 '15 at 21:58