How to grep multi line string with new line characters or tab characters or spaces

Question

My test file has text like:

> cat test.txt
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

I am trying to match all single lines ending with semicolon (;) and having text "dummy(". Then I need to extract the string present in the double quotes inside dummy. I have come up with the following command, but it matches only the first and third statement.

> perl -ne 'print if /dummy/ .. /;/' test.txt | grep -oP 'dummy\((.|\n)*,'
dummy("test1",
dummy("test3",

With -o flag I expected to extract string between the double quotes inside dummy. But that is also not working. Can you please give me an idea on how to proceed?

Expected output is:

test1
test2
test3
test4

Some of the below answers work for basic file structures. If lines contains more than 1 new line characters, then code breaks. e.g. Input text files with more new line characters:

new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");
new dummy("test5",
        random5).foo("bar5");
new dummy("test6", random6).foo(
        "bar6");
new dummy("test7", random7).foo("
        bar7");

I referred to following SO links:

How to give a pattern for new line in grep?

how to grep multiple lines until ; (semicolon)

For irregular quoted fields, you might like to have a look at `quotewords` from the core module [`Text::ParseWords`](https://perldoc.perl.org/Text::ParseWords). — TLP, Apr 14 '22 at 18:11

glenn jackman · Accepted Answer · 2022-04-14T17:38:58.407

3

@TLP was pretty close:

perl -0777 -nE 'say for map {s/^\s+|\s+$//gr} /\bdummy\(\s*"(.+?)"/gs' test.txt

test1
test2

Using

-0777 to slurp the file in as a single string
/\bdummy\(\s*"(.+?)"/gs finds all the quoted string content after "dummy(" (with optional whitespace before the opening quote)
- the s flag allows . to match newlines.
- any string containing escaped double quotes will break this regex
map {s/^\s+|\s+$//gr} trims leading/trailing whitespace from each string.

edited Apr 14 '22 at 17:38

answered Apr 14 '22 at 16:47

glenn jackman

238,783
38
220
352

Thank you so much! It almost works. :) When value to be extracted including the double quotes is on new line, it doesn't work. I have updated my question with the scenario of "test4". How can we match this additional scenario? – user613114 Apr 14 '22 at 16:57

anubhava · Answer 2 · 2022-04-14T17:22:13.867

3

This perl should work:

perl -0777 -pe 's/(?m)^[^(]* dummy\(\s*"\s*([^"]+).*/$1/g' file

test1
test2
test3
test4

Following gnu-grep + tr should also work:

grep -zoP '[^(]* dummy\(\s*"\s*\K[^"]+"' file | tr '"' '\n'

test1
test2
test3
test4

edited Apr 14 '22 at 17:22

answered Apr 14 '22 at 16:59

anubhava

761,203
64
569
643

score 2 · Answer 3 · answered Apr 15 '22 at 02:35

2

With your shown samples, please try following awk code, written and tested in GNU awk.

awk -v RS='(^|\n)new[^;]*;' '
RT{
  rt=RT
  gsub(/\n+|[[:space:]]+/,"",rt)
  match(rt,/"[^"]*"/)
  print substr(rt,RSTART+1,RLENGTH-2)
}
'  Input_file

answered Apr 15 '22 at 02:35

RavinderSingh13

130,504
14
57
93

dawg · Answer 4 · 2022-04-14T19:42:55.640

Given:

$ cat file
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

You can use GNU grep this way:

$ grep -ozP '[^;]*\bdummy[^";]*"\s*\K[^";]*[^;]*;' file | tr '\000' '\n' | grep -oP '^[^"]*'
test1
test2
test3
test4

Somewhat more robust, if this is a ; delimited text, you can:

split on the ;;
filter for /\bdummy\b/;
grab the first field in quotes;
strip the whitespace.

Here is all that in a ruby:

ruby -e 'puts $<.read.split(/(?<=;)/).
                select{|b| b[/\bdummy\b/]}.
                map{|s| s[/(?<=")[^"]*/].strip}' file 
# same output

Thank you @dawg. This breaks if "randomx" strings are on the new line. e.g. you can try adding a new character before random1. — user613114, Apr 14 '22 at 17:06

score 1 · Answer 5 · answered Apr 14 '22 at 18:15

You can use Text::ParseWords to extract the quoted fields.

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = do {
    local $/;
    <DATA>;
};   # slurp the text into a variable
my @lines = quotewords(q("), 1, $str);   # extract fields
my @txt;

for (0 .. $#lines) {
    if ($lines[$_] =~ /\bdummy\s*\(/) {
        push @txt, $lines[$_+1];         # target text will be in fields following "dummy("
    }
}

s/^\s+|\s+$//g for @txt;     # trim leading/trailing whitespace
print Dumper \@txt;

__DATA__
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

Output:

$VAR1 = [
          'test1',
          'test2',
          'test3',
          'test4'
        ];

RARE Kpop Manifesto · Answer 6 · 2022-04-15T10:58:55.903

awk-based solution handling everything via FS :

<test1.txt gawk -b -e 'BEGIN { RS="^$"

 FS="((^|\\n)?"(___="[^\\n")"]+y[(]"(_="[ \\t\\n]*")(__="[\\42]")(_)\
    "|"(_="[ \\t]*")(__)(_)"[,]"(___)";]+[;][\\n])+"} sub(OFS=ORS,"",$!--NF)'          

test1
test2
test3
test4

gawk was benchmarked at 2 million rows at 5.15 secs, so unless your input file is beyond 100 MB, this suffices.

*** caveat : avoid using mawk-1.9.9.6 with this solution

score 0 · Answer 7 · answered Apr 15 '22 at 15:56

Suggesting simple gawk script (standard linux awk):

 awk '/dummy/{print gensub("[[:space:]]*","",1,$2)}' RS=';' FS='"'  input.txt

Explanation:

RS=';' Set awk records separator to ;

FS='"' Set awk fields separator to "

/dummy/ Filter only records matchingdummy RexExp

gensub("[[:space:]]*","",1,$2) Trim any white-spaces from the beginning of 2nd field

print gensub("[[:space:]]*","",1,$2) print trimmed 2nd field

How to grep multi line string with new line characters or tab characters or spaces

7 Answers7

Explanation: