2

My test file has text like:

> cat test.txt
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

I am trying to match all single lines ending with semicolon (;) and having text "dummy(". Then I need to extract the string present in the double quotes inside dummy. I have come up with the following command, but it matches only the first and third statement.

> perl -ne 'print if /dummy/ .. /;/' test.txt | grep -oP 'dummy\((.|\n)*,'
dummy("test1",
dummy("test3",

With -o flag I expected to extract string between the double quotes inside dummy. But that is also not working. Can you please give me an idea on how to proceed?

Expected output is:

test1
test2
test3
test4

Some of the below answers work for basic file structures. If lines contains more than 1 new line characters, then code breaks. e.g. Input text files with more new line characters:

new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");
new dummy("test5",
        random5).foo("bar5");
new dummy("test6", random6).foo(
        "bar6");
new dummy("test7", random7).foo("
        bar7");

I referred to following SO links:

How to give a pattern for new line in grep?

how to grep multiple lines until ; (semicolon)

user613114
  • 2,731
  • 11
  • 47
  • 73
  • 1
    `perl -0777 -nle 'print for /dummy\("([^"]*)"/g' test.txt` – TLP Apr 14 '22 at 16:36
  • For irregular quoted fields, you might like to have a look at `quotewords` from the core module [`Text::ParseWords`](https://perldoc.perl.org/Text::ParseWords). – TLP Apr 14 '22 at 18:11

7 Answers7

3

@TLP was pretty close:

perl -0777 -nE 'say for map {s/^\s+|\s+$//gr} /\bdummy\(\s*"(.+?)"/gs' test.txt
test1
test2

Using

  • -0777 to slurp the file in as a single string
  • /\bdummy\(\s*"(.+?)"/gs finds all the quoted string content after "dummy(" (with optional whitespace before the opening quote)
    • the s flag allows . to match newlines.
    • any string containing escaped double quotes will break this regex
  • map {s/^\s+|\s+$//gr} trims leading/trailing whitespace from each string.
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Thank you so much! It almost works. :) When value to be extracted including the double quotes is on new line, it doesn't work. I have updated my question with the scenario of "test4". How can we match this additional scenario? – user613114 Apr 14 '22 at 16:57
3

This perl should work:

perl -0777 -pe 's/(?m)^[^(]* dummy\(\s*"\s*([^"]+).*/$1/g' file

test1
test2
test3
test4

Following gnu-grep + tr should also work:

grep -zoP '[^(]* dummy\(\s*"\s*\K[^"]+"' file | tr '"' '\n'

test1
test2
test3
test4
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

With your shown samples, please try following awk code, written and tested in GNU awk.

awk -v RS='(^|\n)new[^;]*;' '
RT{
  rt=RT
  gsub(/\n+|[[:space:]]+/,"",rt)
  match(rt,/"[^"]*"/)
  print substr(rt,RSTART+1,RLENGTH-2)
}
'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
1

Given:

$ cat file
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

You can use GNU grep this way:

$ grep -ozP '[^;]*\bdummy[^";]*"\s*\K[^";]*[^;]*;' file | tr '\000' '\n' | grep -oP '^[^"]*'
test1
test2
test3
test4

Somewhat more robust, if this is a ; delimited text, you can:

  1. split on the ;;
  2. filter for /\bdummy\b/;
  3. grab the first field in quotes;
  4. strip the whitespace.

Here is all that in a ruby:

ruby -e 'puts $<.read.split(/(?<=;)/).
                select{|b| b[/\bdummy\b/]}.
                map{|s| s[/(?<=")[^"]*/].strip}' file 
# same output
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Thank you @dawg. This breaks if "randomx" strings are on the new line. e.g. you can try adding a new character before random1. – user613114 Apr 14 '22 at 17:06
1

You can use Text::ParseWords to extract the quoted fields.

use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;

my $str = do {
    local $/;
    <DATA>;
};   # slurp the text into a variable
my @lines = quotewords(q("), 1, $str);   # extract fields
my @txt;

for (0 .. $#lines) {
    if ($lines[$_] =~ /\bdummy\s*\(/) {
        push @txt, $lines[$_+1];         # target text will be in fields following "dummy("
    }
}

s/^\s+|\s+$//g for @txt;     # trim leading/trailing whitespace
print Dumper \@txt;

__DATA__
new dummy("test1", random1).foo("bar1");
new dummy("
        test2", random2);
new dummy("test3", random3).foo("bar3");
new dummy = dummy(
            "test4", random4).foo("bar4");

Output:

$VAR1 = [
          'test1',
          'test2',
          'test3',
          'test4'
        ];
TLP
  • 66,756
  • 10
  • 92
  • 149
0

awk-based solution handling everything via FS :

<test1.txt gawk -b -e 'BEGIN { RS="^$"

 FS="((^|\\n)?"(___="[^\\n")"]+y[(]"(_="[ \\t\\n]*")(__="[\\42]")(_)\
    "|"(_="[ \\t]*")(__)(_)"[,]"(___)";]+[;][\\n])+"} sub(OFS=ORS,"",$!--NF)'          

test1
test2
test3
test4

gawk was benchmarked at 2 million rows at 5.15 secs, so unless your input file is beyond 100 MB, this suffices.

*** caveat : avoid using mawk-1.9.9.6 with this solution

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11
0

Suggesting simple gawk script (standard linux awk):

 awk '/dummy/{print gensub("[[:space:]]*","",1,$2)}' RS=';' FS='"'  input.txt

Explanation:

RS=';' Set awk records separator to ;

FS='"' Set awk fields separator to "

/dummy/ Filter only records matchingdummy RexExp

gensub("[[:space:]]*","",1,$2) Trim any white-spaces from the beginning of 2nd field

print gensub("[[:space:]]*","",1,$2) print trimmed 2nd field

Dudi Boy
  • 4,551
  • 1
  • 15
  • 30