2

I'm searching sphinx .rst text files for

:variablerolename:`text may span newline`

There can be multiple, different :variablerolename: pieces starting somewhere on a single line.

So, as input example, I have:

yada :role2:`texty text` yada :role:`text
line` yada filler
yada yada :role:`text of role` yada yada :role2:`start of text
rest of text`
more text :rolename:`Text after this role`
filler :otherrole:`This role 
text` filler

Searching for answers I've gotten as far as

grep -P '(?s):[a-z].*:`.*`' filename

But I don't think this is properly matching multiple :role: blocks on a line because one line of output I get is

yada yada :role:`text of role` yada yada :role2:`start of text

but the rest of the role2 text up to the closing back quote isn't printed on the next line.

The output I want would be just the role name and the back quoted text, each instance alone on a line, without the pre and post text. So, something like:

:role2:`texty text`
:role:`text line`
:role:`text of role`
:role2:`start of text rest of text`
:rolename:`Text after this role`
:otherrole:`This role text`

I'll be passing the output of this on to |sort|uniq so need single lines.

I'm limited to using what's available on RHEL 6.7 (so latest features might not be there)

  • GNU bash, version 4.1.2
  • GNU Awk 3.1.7
  • grep (GNU grep) 2.20
  • GNU sed version 4.2.1
Community
  • 1
  • 1
Torfey
  • 37
  • 5
  • You *might* have an easier time of this if you use `awk` and set `RS` to `:`. That'll still leave you needing to pull out just the backtick quoted text and mapping it to the *previous* record but it should be doable. – Etan Reisner Apr 18 '16 at 16:41
  • It's hard to tell what you are trying to do. Please edit your question to post concise, testable sample input and the expected output given that input, i.e. a [mcve]. Right now it seems to be several disconnected and unclear examples. – Ed Morton Apr 18 '16 at 19:32
  • You're right. Sorry. Tried to clean it up. – Torfey Apr 18 '16 at 20:03

1 Answers1

1

It's not clear from your question but this may be what you need (uses GNU awk for multi-char RS and RT):

awk -v RS=':[^:]+:`[^`]+`' 'RT{print RT}' file

e.g.:

$ cat file
yada yada :role:`text of role` yada yada :role2:`start of text
end of text` yada yada

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{print RT}' file
:role:`text of role`
:role2:`start of text
end of text`

To replace any newlines with blank chars would just be:

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{gsub(/\n/," ",RT); print RT}' file
:role:`text of role`
:role2:`start of text end of text`

To only output unique values:

$ awk -v RS=':[^:]+:`[^`]+`' 'RT{gsub(/\n/," ",RT); if (!seen[RT]++) print RT}' file
:role:`text of role`
:role2:`start of text end of text`
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • This gets me extremely close. As your example shows, still have the newline between the backquoted text. If I could get rid of that newline it would be exactly what I need. – Torfey Apr 18 '16 at 20:09
  • 1
    Maybe some extra newline magic (though it makes it less nice): `awk -v RS=':[^:]+:`[^`]+' 'RT{print gensub(/ *[\r\n]+ */, " ", "g", RT)}' file`. (Crap. losing the backslash in the RS definition.) – joepd Apr 18 '16 at 20:28
  • I think that addition does what I need for joining each together on one line. And nicely puts a space where the newline was between the backquotes so `sort |uniq` will work as intended on output. [I was trying and failing to use another tool on the output of awk to match lines that didn't end in backquote, then join that line with next line.] – Torfey Apr 18 '16 at 20:48
  • Though, as you say because of formatting limitations in comments, it zaps the backquotes from the original answer and can't use the 4 space code style of an 'answer'. – Torfey Apr 18 '16 at 20:55
  • 1
    I added a script that replaces the newlines with blank chars. You don't need `sort | uniq` if all you want is unique values output btw. I've added that script too. – Ed Morton Apr 18 '16 at 21:28