2

Here's my input:

<array>
    <string>extra1</string>
    <string>extra2</string>
    <string>Yellow
5</string>

Note: there's a space and newline between "Yellow" and "5"

I am piping this to sed:

| sed -n 's#.*<string>\(.*\)</string>#\1#p'

and I am getting the output:

extra1
extra2

I know that, because sed strips the newline from the end of each input line, the newline is not there to be matched - so that accounts for the result. I have read articles on adding the next line from the buffer, but I can't work out what I need to use in the pattern match to get this to work.

The output I want is:

extra1
extra2
Yellow 5

(In case it makes a difference, I am using a Mac, so I need this to work with - I think - the FreeBSD variant of sed.)

Of course, if another tool is better for what I want to achieve I am open to suggestions! Thanks!

Cyrus
  • 84,225
  • 14
  • 89
  • 153
Lorccan
  • 793
  • 5
  • 20
  • 1
    It looks like you're trying to parse (x)html with regular expressions. [Please don't do it.](http://stackoverflow.com/a/1732454/237955) – amphetamachine Feb 08 '16 at 20:41
  • Actually I am not. This is (partial) output from a process that decodes a binary plist and gives ASCII XML – Lorccan Feb 08 '16 at 20:56
  • Possible duplicate of [Match a string that contains a newline using sed](https://stackoverflow.com/q/23850789/608639) – jww Aug 20 '18 at 03:21

6 Answers6

4

Join the lines and tear them apart:

tr -d "\n" < file| grep -o "<string>[^<]*</string>"|sed 's/<string>\(.*\)<\/string>/\1/'
Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Thank you. This does what I want - and, importantly, I understand it! – Lorccan Feb 08 '16 at 22:53
  • You can get the same effect by by setting `IFS` to **not** include the carriage return. Something like `IFStemp=$IFS; IFS=$' \t'; sed...; IFS=$IFStmp` – SaxDaddy Feb 08 '16 at 23:09
  • @SaxDaddy: Interesting! I tried `IFS=$' \t'; printf "%s\n%s\n" "first line" "second line" | sed 's/line/part/'` but I just got two lines. What do I do wrong? – Walter A Feb 09 '16 at 20:02
  • your first `\n` added a CR between the two lines. you probably want `printf "%s %s\n"` – SaxDaddy Feb 09 '16 at 21:34
  • @SaxDaddy I used the CR since the main issue is how to delete the CR between `` and ``. – Walter A Feb 09 '16 at 21:42
3

Close your array tag and try this with xmlstarlet and GNU sed:

xmlstarlet sel -t -v "//array/string" input.xml | sed '/ $/{:a;N;s/\n//;ta}'

Or only with xmlstarlet:

xmlstarlet sel -t --match '//array/string' --value-of 'normalize-space()' -n input.xml

Output:

extra1
extra2
Yellow 5
Cyrus
  • 84,225
  • 14
  • 89
  • 153
2

Any time you start talking about "buffers" or "hold space" or sed constructs other than s, g, and p (with -n) you're simply using the wrong tool. All of that stuff for sed became obsolete in the mid-1970s when awk was invented so just use awk. Here's one way with GNU awk for multi-char RS:

$ awk -v RS='</?string>' '!(NR%2){gsub(/\n/," "); print}' file
extra1
extra2
Yellow 5

The above just prints whatever's between <string> and </string> after converting any newlines to blank chars.

With other awks one way would be:

$ cat tst.awk
{ rec = (rec=="" ? "" : rec " ") $0 }
END {
    split(rec,f,"</?string>")
    for (i=2;i in f;i+=2) {
        print f[i]
    }
}

$ awk -f tst.awk file
extra1
extra2
Yellow 5
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    I take your point about the mid-1970s. I'll have a better look at awk. – Lorccan Feb 08 '16 at 22:55
  • It will be well worth your while as we see a lot of people asking for sed answers and, unfortunately for them, getting them so it'd be good to know which sed constructs are still extremely useful and which are just a mental exercise/attempt to turn lead to gold! The book Effective Awk Programming, 4th Edition, by Arnold Robbins is your best starting point. – Ed Morton Feb 08 '16 at 23:03
  • 1
    I will. The biggest issue I have (apart from ignorance) is that I have too many ideas flowing at me (mostly from here) and insufficient time to do justice to researching them all properly! – Lorccan Feb 08 '16 at 23:19
1

perl is available on OSX by default so you can use:

perl -0ne 's#<string>([^<]*)</string>#sub{$x=$1;$x=~tr/\n/ /;print $x."\n";}->()#eg' file.xml
extra1
extra2
Yellow 5

Alternatively you can install gnu-awk using home brew and use:

awk -v RS= -v FPAT='<string>([^<]*)</string>' 'for(i=1; i<=NF; i++) {
   gsub(/<\/?string>/, "", $i); gsub(/\n/, " ", $i); print $i}}' file.xml
extra1
extra2
Yellow 5
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

You can approach this problem with xmllint. I modified your example slightly so that you can see what's going on.

test.xml

<array>
  <string1>extra1</string1>
  <string2>extra2</string2>
  <string3>Yellow
5</string3>
</array>

Since you want the string with the line break, I made this value unique. Now use xmllint and sed to get your results

[saxdaddy ~]$  x="$(xmllint --xpath "/array/string3" test.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g')"
[saxdaddy ~]$  echo $x
Yellow 5

xmllint's xpath feature will search the XML in dictionary manner. sed will then strip our the beginning and ending tags. The "trick" to this is using quotes to capture the variable and then not using quotes to echo the result.

If your target tag is not unique in the file path, then you can craft a for loop to look for $'\n' (a line break) and set that to your variable.

SaxDaddy
  • 246
  • 2
  • 9
0

Please use a tool like that's designed to parse xml:

xidel -s input.xml -e '//string/normalize-space(.)'
extra1
extra2
Yellow 5
Reino
  • 3,203
  • 1
  • 13
  • 21