sed to match pattern across a newline

Question

Here's my input:

<array>
    <string>extra1</string>
    <string>extra2</string>
    <string>Yellow
5</string>

Note: there's a space and newline between "Yellow" and "5"

I am piping this to sed:

| sed -n 's#.*<string>\(.*\)</string>#\1#p'

and I am getting the output:

extra1
extra2

I know that, because sed strips the newline from the end of each input line, the newline is not there to be matched - so that accounts for the result. I have read articles on adding the next line from the buffer, but I can't work out what I need to use in the pattern match to get this to work.

The output I want is:

extra1
extra2
Yellow 5

(In case it makes a difference, I am using a Mac, so I need this to work with - I think - the FreeBSD variant of sed.)

Of course, if another tool is better for what I want to achieve I am open to suggestions! Thanks!

It looks like you're trying to parse (x)html with regular expressions. [Please don't do it.](http://stackoverflow.com/a/1732454/237955) — amphetamachine, Feb 08 '16 at 20:41
Actually I am not. This is (partial) output from a process that decodes a binary plist and gives ASCII XML — Lorccan, Feb 08 '16 at 20:56
Possible duplicate of [Match a string that contains a newline using sed](https://stackoverflow.com/q/23850789/608639) — jww, Aug 20 '18 at 03:21

score 4 · Accepted Answer · answered Feb 08 '16 at 22:25

4

Join the lines and tear them apart:

tr -d "\n" < file| grep -o "<string>[^<]*</string>"|sed 's/<string>\(.*\)<\/string>/\1/'

answered Feb 08 '16 at 22:25

Walter A

19,067
2
23
43

Thank you. This does what I want - and, importantly, I understand it! – Lorccan Feb 08 '16 at 22:53
You can get the same effect by by setting `IFS` to **not** include the carriage return. Something like `IFStemp=$IFS; IFS=$' \t'; sed...; IFS=$IFStmp` – SaxDaddy Feb 08 '16 at 23:09
@SaxDaddy: Interesting! I tried `IFS=$' \t'; printf "%s\n%s\n" "first line" "second line" | sed 's/line/part/'` but I just got two lines. What do I do wrong? – Walter A Feb 09 '16 at 20:02
your first `\n` added a CR between the two lines. you probably want `printf "%s %s\n"` – SaxDaddy Feb 09 '16 at 21:34
@SaxDaddy I used the CR since the main issue is how to delete the CR between `` and ``. – Walter A Feb 09 '16 at 21:42

Cyrus · Answer 2 · 2020-11-19T21:15:17.120

3

Close your array tag and try this with xmlstarlet and GNU sed:

xmlstarlet sel -t -v "//array/string" input.xml | sed '/ $/{:a;N;s/\n//;ta}'

Or only with xmlstarlet:

xmlstarlet sel -t --match '//array/string' --value-of 'normalize-space()' -n input.xml

Output:

extra1
extra2
Yellow 5

edited Nov 19 '20 at 21:15

answered Feb 08 '16 at 20:59

Cyrus

84,225
14
89
153

I am not familiar with xmlstarlet, but I'll take a look. – Lorccan Feb 08 '16 at 22:54

Ed Morton · Answer 3 · 2016-02-08T22:16:07.747

2

Any time you start talking about "buffers" or "hold space" or sed constructs other than s, g, and p (with -n) you're simply using the wrong tool. All of that stuff for sed became obsolete in the mid-1970s when awk was invented so just use awk. Here's one way with GNU awk for multi-char RS:

$ awk -v RS='</?string>' '!(NR%2){gsub(/\n/," "); print}' file
extra1
extra2
Yellow 5

The above just prints whatever's between <string> and </string> after converting any newlines to blank chars.

With other awks one way would be:

$ cat tst.awk
{ rec = (rec=="" ? "" : rec " ") $0 }
END {
    split(rec,f,"</?string>")
    for (i=2;i in f;i+=2) {
        print f[i]
    }
}

$ awk -f tst.awk file
extra1
extra2
Yellow 5

edited Feb 08 '16 at 22:16

answered Feb 08 '16 at 21:32

Ed Morton

188,023
17
78
185

1

I take your point about the mid-1970s. I'll have a better look at awk. – Lorccan Feb 08 '16 at 22:55
It will be well worth your while as we see a lot of people asking for sed answers and, unfortunately for them, getting them so it'd be good to know which sed constructs are still extremely useful and which are just a mental exercise/attempt to turn lead to gold! The book Effective Awk Programming, 4th Edition, by Arnold Robbins is your best starting point. – Ed Morton Feb 08 '16 at 23:03
1

I will. The biggest issue I have (apart from ignorance) is that I have too many ideas flowing at me (mostly from here) and insufficient time to do justice to researching them all properly! – Lorccan Feb 08 '16 at 23:19

score 1 · Answer 4 · answered Feb 08 '16 at 21:07

1

perl is available on OSX by default so you can use:

perl -0ne 's#<string>([^<]*)</string>#sub{$x=$1;$x=~tr/\n/ /;print $x."\n";}->()#eg' file.xml
extra1
extra2
Yellow 5

Alternatively you can install gnu-awk using home brew and use:

awk -v RS= -v FPAT='<string>([^<]*)</string>' 'for(i=1; i<=NF; i++) {
   gsub(/<\/?string>/, "", $i); gsub(/\n/, " ", $i); print $i}}' file.xml
extra1
extra2
Yellow 5

answered Feb 08 '16 at 21:07

anubhava

761,203
64
569
643

1

I'll look at perl as well as awk. Thanks for your suggestions. – Lorccan Feb 08 '16 at 22:56

score 0 · Answer 5 · answered Feb 12 '16 at 00:04

You can approach this problem with xmllint. I modified your example slightly so that you can see what's going on.

test.xml

<array>
  <string1>extra1</string1>
  <string2>extra2</string2>
  <string3>Yellow
5</string3>
</array>

Since you want the string with the line break, I made this value unique. Now use xmllint and sed to get your results

[saxdaddy ~]$  x="$(xmllint --xpath "/array/string3" test.xml | sed '/^\/ >/d' | sed 's/<[^>]*.//g')"
[saxdaddy ~]$  echo $x
Yellow 5

xmllint's xpath feature will search the XML in dictionary manner. sed will then strip our the beginning and ending tags. The "trick" to this is using quotes to capture the variable and then not using quotes to echo the result.

If your target tag is not unique in the file path, then you can craft a for loop to look for $'\n' (a line break) and set that to your variable.

score 0 · Answer 6 · answered Nov 28 '20 at 16:03

0

Please use a tool like xidel that's designed to parse xml:

xidel -s input.xml -e '//string/normalize-space(.)'
extra1
extra2
Yellow 5

answered Nov 28 '20 at 16:03

Reino

3,203
1
13
21

sed to match pattern across a newline

6 Answers6

Linked