7

Can I use sed if I need to extract a pattern enclosed by a specific pattern, if it exists in a line?

Suppose I have a file with the following lines :

There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

In both the cases I have to scan the line for the first occurring pattern i.e ' [/ ' or '/* ' in their respective cases and store the following pattern till then exit pattern i.e ' /] 'or ' */ ' respectively .

In short , I need fear and answer .If possible , Can it be extended for multiple lines ;in the sense ,if the exit pattern occurs in a line different than the same .

Any kind of help in the form of suggestions or algorithms are welcome. Thanks in advance for the replies

Community
  • 1
  • 1
Gil
  • 1,518
  • 4
  • 16
  • 32
  • I am not exactly sure if it can be done by SED , ant btw I wouldnt mind a perl script. – Gil Jun 19 '12 at 14:21
  • As for`sed`, see my [question](http://stackoverflow.com/questions/11024245/sed-recipe-how-to-do-stuff-between-two-patterns-that-can-be-either-on-one-line): there are no easy ways proposed so far, but something can be done. – Lev Levitsky Jun 19 '12 at 14:39
  • @LevLevitsky Pretty interesting ! Wl definetely have to look through it again , once aint enough. Thanks for adding the link :) – Gil Jun 20 '12 at 07:36

3 Answers3

4
use strict;
use warnings;

while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#g) {
        print "$2\n";
    }
}


__DATA__
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.
Advice is what we ask for when we already know the /* answer */ but wish we didn’t.

As a one-liner:

perl -nlwe 'while (m#/(\*?)(.*?)\1/#g) { print $2 }' input.txt

The inner while loop will iterate between all matches with the /g modifier. The backreference \1 will make sure we only match identical open/close tags.

If you need to match blocks that extend over multiple lines, you need to slurp the input:

use strict;
use warnings;

$/ = undef;
while (<DATA>) {
    while (m#/(\*?)(.*?)\1/#sg) {
        print "$2\n";
    }
}

__DATA__
    There are many who dare not kill themselves for [/fear/] of what the neighbors will say. /* foofer */ 
    Advice is what we ask for when we already know the /* answer */ but wish we didn’t.
foo bar /
baz 
baaz / fooz

One-liner:

perl -0777 -nlwe 'while (m#/(\*?)(.*?)\1/#sg) { print $2 }' input.txt

The -0777 switch and $/ = undef will cause file slurping, meaning all of the file is read into a scalar. I also added the /s modifier to allow the wildcard . to match newlines.

Explanation for the regex: m#/(\*?)(.*?)\1/#sg

m#              # a simple m//, but with # as delimiter instead of slash
    /(\*?)      # slash followed by optional *
        (.*?)   # shortest possible string of wildcard characters
    \1/         # backref to optional *, followed by slash
#sg             # s modifier to make . match \n, and g modifier 

The "magic" here is that the backreference requires a star * only when one is found before it.

TLP
  • 66,756
  • 10
  • 92
  • 149
  • 2
    Will it match over multiple lines? – Zaid Jun 19 '12 at 14:40
  • Good job, though your regex is a little sore on my eyes :) – Zaid Jun 19 '12 at 14:52
  • @Zaid It's as sour as it needs to be :P – TLP Jun 19 '12 at 14:56
  • @TLP Though a little tough for me to digest , it does the job without a glitch in my case :) and the explanation is just great ! Thanks a lot Top Level Programmer ;) – Gil Jun 20 '12 at 07:42
  • @Geekasaur That's not what my nick means. :) If this answers your question, you should click the check mark to mark it as accepted. – TLP Jun 20 '12 at 10:01
  • @TLP Done !Just a small query. Is it possible to invert the result as in display the non matching part ? – Gil Jun 20 '12 at 10:17
  • @Geekasaur Yes. Change the regex to a substitution that removes the matches, and print the line instead of `$2`. E.g. `s#/(\*?)(.*?)\1/##sg; print;` for the latter one-liner. – TLP Jun 20 '12 at 10:46
  • @TLP Well ,that does it ! Thanks again :) – Gil Jun 20 '12 at 11:18
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/12805/discussion-between-geekasaur-and-tlp) – Gil Jun 20 '12 at 13:14
  • @TLP Sorry to bother you again TLP.Is it possible to delete the exact space occupied by the matching pattern on the dislpay of the non matchning part ? – Gil Jun 21 '12 at 07:22
  • @Geekasaur That sounds exactly what you just asked me. How to invert the match. – TLP Jun 21 '12 at 14:38
  • @ yep ! I agree , I want to remove the ' void 'or the 'blank space' occupied by the match in the inverted case . As per the example when the invert match is executed we get " There are many who dare not kill themselves for |blank space| of what the neighbors will say ". I want to remove the space left by the match. – Gil Jun 21 '12 at 15:04
  • 1
    @Geekasaur That's because you have one extra space. You can always solve that by inserting ` *` (that's a space followed by a star) before and after the match, and insert a single space in the substitution. `s#/ *(\*?)(.*?)\1 */# #sg;` – TLP Jun 21 '12 at 15:38
1

Quick and dirty way in awk

awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' input_file

Test:

$ cat file
There are many who dare not kill themselves for [/fear/] of what the neighbors will say.

Advice is what we ask for when we already know the /* answer */ but wish we didn't.
$ awk 'NF{ for (i=1;i<=NF;i++) if($i ~ /^\[\//) { print gensub (/^..(.*)..$/,"\\1","g",$i); } else if ($i ~ /^\/\*/) print $(i+1);next}1' file
fear

answer
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
1

Single-Line Matches

If you really want to do this in sed, you can extract your delimited patterns relatively easily as long as they are on the same line.

# Using GNU sed. Escape a whole lot more if your sed doesn't handle
# the -r flag.
sed -rn 's![^*/]*(/\*?.*/).*!\1!p' /tmp/foo

Multi-Line Matches

If you want to perform multi-line matches with sed, things get a little uglier. However, it can certainly be done.

# Multi-line matching of delimiters with GNU sed.
sed -rn ':loop
         /\/[^\/]/ { 
             N
             s![^*/]+(/\*?.*\*?/).*!\1!p
             T loop
         }' /tmp/foo

The trick is to look for a starting delimiter, then keep appending lines in a loop until you find the ending delimiter.

This works really well as long as you really do have an ending delimiter. Otherwise, the contents of the file will keep being appended to the pattern space until sed finds one, or until it reaches the end of the file. This may cause problems with certain versions of sed or with really, really large files where the size of the pattern space gets out of hand.

See GNU sed's Limitations and Non-limitations for more information.

Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199