match pattern possibly over newline and only print pattern

Question

I'm searching sphinx .rst text files for

:variablerolename:`text may span newline`

There can be multiple, different :variablerolename: pieces starting somewhere on a single line.

So, as input example, I have:

yada :role2:`texty text` yada :role:`text
line` yada filler
yada yada :role:`text of role` yada yada :role2:`start of text
rest of text`
more text :rolename:`Text after this role`
filler :otherrole:`This role 
text` filler

Searching for answers I've gotten as far as

grep -P '(?s):[a-z].*:`.*`' filename

But I don't think this is properly matching multiple :role: blocks on a line because one line of output I get is

yada yada :role:`text of role` yada yada :role2:`start of text

but the rest of the role2 text up to the closing back quote isn't printed on the next line.

The output I want would be just the role name and the back quoted text, each instance alone on a line, without the pre and post text. So, something like:

:role2:`texty text`
:role:`text line`
:role:`text of role`
:role2:`start of text rest of text`
:rolename:`Text after this role`
:otherrole:`This role text`

I'll be passing the output of this on to |sort|uniq so need single lines.

I'm limited to using what's available on RHEL 6.7 (so latest features might not be there)

GNU bash, version 4.1.2
GNU Awk 3.1.7
grep (GNU grep) 2.20
GNU sed version 4.2.1

You *might* have an easier time of this if you use `awk` and set `RS` to `:`. That'll still leave you needing to pull out just the backtick quoted text and mapping it to the *previous* record but it should be doable. — Etan Reisner, Apr 18 '16 at 16:41
It's hard to tell what you are trying to do. Please edit your question to post concise, testable sample input and the expected output given that input, i.e. a [mcve]. Right now it seems to be several disconnected and unclear examples. — Ed Morton, Apr 18 '16 at 19:32

Ed Morton · Accepted Answer · 2016-04-18T21:27:06.410

1

It's not clear from your question but this may be what you need (uses GNU awk for multi-char RS and RT):

awk -v RS=':[^:]+:`[^`]+`' 'RT{print RT}' file

e.g.:

$ cat file
yada yada :role:`text of role` yada yada :role2:`start of text
end of text` yada yada

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{print RT}' file
:role:`text of role`
:role2:`start of text
end of text`

To replace any newlines with blank chars would just be:

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{gsub(/\n/," ",RT); print RT}' file
:role:`text of role`
:role2:`start of text end of text`

To only output unique values:

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{gsub(/\n/," ",RT); if (!seen[RT]++) print RT}' file
:role:`text of role`
:role2:`start of text end of text`

edited Apr 18 '16 at 21:27

answered Apr 18 '16 at 19:39

Ed Morton

188,023
17
78
185

This gets me extremely close. As your example shows, still have the newline between the backquoted text. If I could get rid of that newline it would be exactly what I need. – Torfey Apr 18 '16 at 20:09
1

Maybe some extra newline magic (though it makes it less nice): `awk -v RS=':[^:]+:`[^`]+' 'RT{print gensub(/ *[\r\n]+ */, " ", "g", RT)}' file`. (Crap. losing the backslash in the RS definition.) – joepd Apr 18 '16 at 20:28
I think that addition does what I need for joining each together on one line. And nicely puts a space where the newline was between the backquotes so `sort |uniq` will work as intended on output. [I was trying and failing to use another tool on the output of awk to match lines that didn't end in backquote, then join that line with next line.] – Torfey Apr 18 '16 at 20:48
Though, as you say because of formatting limitations in comments, it zaps the backquotes from the original answer and can't use the 4 space code style of an 'answer'. – Torfey Apr 18 '16 at 20:55
1

I added a script that replaces the newlines with blank chars. You don't need `sort | uniq` if all you want is unique values output btw. I've added that script too. – Ed Morton Apr 18 '16 at 21:28

match pattern possibly over newline and only print pattern

1 Answers1