Regex (grep) for multi-line search needed

Question

I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.

I've tried a few variations on the following:

$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"

This, however, just runs forever. Can anyone help me with the correct syntax please?

The grep you've indicated here runs forever because you have not specified any files to search at the end of the command... The '--include' is a filter of the files named and doesn't actually provide you any files to be filtered. — marklark, Mar 04 '13 at 17:01

score 624 · Accepted Answer · edited Jan 20 '22 at 21:19

624

Without the need to install the grep variant pcregrep, you can do a multiline search with grep.

$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c

Explanation:

-P activate perl-regexp for grep (a powerful extension of regular expressions)

-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.

-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.

In regexp:

(?s) activate PCRE_DOTALL, which means that . finds any character or newline

\N find anything except newline, even with PCRE_DOTALL activated

.*? find . in non-greedy mode, that is, stops as soon as possible.

^ find start of line

\1 backreference to the first group (\s*). This is a try to find the same indentation of method.

As you can imagine, this search prints the main method in a C (*.c) source file.

edited Jan 20 '22 at 21:19

Jonathan Leffler

730,956
141
904
1,278

answered Aug 23 '11 at 20:26

albfan

12,542
4
61
80

16

/bin/grep: The -P and -z options cannot be combined – Oli Sep 05 '11 at 08:13
9

/bin/grep: PCRE does not support \L, \l, \N, \U, or \u – Oli Sep 05 '11 at 08:14
5

I'm using **GNU grep 2.6.3**, bundled in **Ubuntu 11.04** and it does, what's your version @Oli ? – albfan Sep 08 '11 at 10:54
gnu grep version 2.5.4, using the last ubuntu LTS (lucid, 10.04) – Oli Sep 16 '11 at 13:37
33

-zo was enough for my multi-line needs, thanks! (upvoted.) – Szocske Oct 18 '11 at 15:02
1

I needed to use "grep -PZo", capital z, on Gentoo (linux). – dfrankow Sep 05 '12 at 20:55
2

"As you can imagine, this search prints the main method in a C (*.c) source file." ... I dare to say it out loud : if you say so ... :) – Benjamin Delichere Dec 26 '12 at 13:29
2

Do you have a solution that works for big files that can't be read into memory? – tommy.carstensen Jul 03 '13 at 10:48
1

Oli, on GNU grep 2.14 (distr with ubuntu 13.04x64), you CAN use -z and -P together, in fact you probably wont get what you want unless you include -Pzo, usually (and start regex with (?s)) -- anyone that uses perl regex often should read (and keep a printout of) the manpage "perlreref" which is a great short text reference to many of the regular expressions you might need, a "cheat sheet" if you will... – osirisgothra Aug 15 '13 at 02:43
@albfan, my version of GNU grep (2.5.1) supports multiline search using `(?s)` but without the need to supply `-z` to identify NUL as the line terminator. Is there a specific reason you use `-z` here? – iruvar Oct 21 '13 at 15:44
`-P` worked for me without `-zo`. So it seems it works with one of the other, according comments. – sites Mar 03 '14 at 23:44
21

I recommend ''**grep -Pazo**'' instead of the unsafer ''-Pzo''. Explanation: the -z switch on non-ASCII files _may_ trigger grep's "binary data" behaviour which changes the return values. Switch ''-a | --text'' prevents that. – rloth Jan 08 '15 at 13:43
How would you loop through the results of these? – 170730350 Sep 21 '16 at 09:09
@Saichovsky not exactly loop, but you can pipe to less and move around: `grep --color=always | less -R`. `gg` go to end, `G` go to init, `/` search, `n` go to next result... – albfan Sep 21 '16 at 15:19
2

FYI, on OS X, the system default `grep` is BSD based and not the GNU version. As a result, PCRE (the `-P` switch) is sadly unsupported. – cavalcade Oct 31 '16 at 02:09
1

You mentioned that -z switch converts new lines into null characters. If you have to search near the new line, would you search for null characters or [[:space:]] or new line? – alpha_989 Sep 24 '17 at 00:13
1

+1 for showing the [embedded modifier trick](https://perldoc.perl.org/perlretut.html#Embedding-comments-and-modifiers-in-a-regular-expression), e.g., `(?s)`! – emallove Oct 19 '17 at 18:16
If you are looking for a multiline grep, use `pcregrep`, see [this answer](https://stackoverflow.com/questions/2686147/how-to-find-patterns-across-multiple-lines-using-grep). – jjmontes Mar 02 '18 at 11:46
GNU grep need to be installed manually. https://apple.stackexchange.com/questions/193288/how-to-install-and-use-gnu-grep-in-osx – MutantMahesh Apr 19 '18 at 13:07
2

-z is not for `substituting newline with null char`, but `Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline` according to the manual. – osexp2000 Sep 26 '18 at 01:47
11

pipe the result to `tr '\0' '\n'` if you need the matches on separate lines! – t.animal Nov 26 '18 at 16:09
It might be worth noting that this is *still* experimental. That and of course it's not portable (though that doesn't have to matter). CentOS man page refers to it being 'highly experimental'; Fedora 29 says only 'experimental'. Would be nice though if it was standardised. – Pryftan Sep 23 '19 at 00:17
4

`-z`, aka. `--null-data` will output an extra NUL(\x00) character for each match. This might be an unwanted side-effect for some use cases. – youfu Jan 16 '20 at 03:33
@youfu, any idea on how to prevent adding extra NUL(\x00) character an the end of a match? P.S. `$ grep --version`: `grep (GNU grep) 3.0 - Packaged by Cygwin (3.0-2)`. – pmor Jun 04 '20 at 20:43
As already noted, `-z` makes `grep` end its output with a `\x00`; other than `tr`anslating it, you may alternatively pipe through `head --bytes=-1` to truncate the null byte. – x a Mar 18 '21 at 11:06
2

Amazing! Thank you for posting. Btw for mac just use `brew install grep` then use this with `ggrep` . – phyatt Sep 03 '21 at 16:40
`grep -Pz '\x00' file.txt` returns exit code `1`. Answer says that `-z` option replaces newlines with null bytes. This is false. Otherwise this would return an exit code of `0` (success) for any file with more than one line. – Myridium Jan 14 '22 at 03:09
I've tried this with GNU grep 3.3 and it simply does not seem to work. `^` doesn't match the beginnings of lines if `-z` is passed. – 2rs2ts Jan 26 '22 at 19:20
use pcre2grep to avoid "grep: memory exhausted" – user1133275 Apr 19 '23 at 09:22

score 214 · Answer 2 · edited Nov 21 '11 at 10:32

214

I am not very good in grep. But your problem can be solved using AWK command. Just see

awk '/select/,/from/' *.sql

The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.

edited Nov 21 '11 at 10:32

Nanne

64,065
16
119
163

answered Sep 15 '10 at 13:22

Amit

3,357
2
15
3

7

Awesome simple solution. Note: The comma is used as a separator in AWK _range pattern_. See full explanation in [section 7.1.3 Specifying Record Ranges with Patterns of AWK user guide](https://www.gnu.org/software/gawk/manual/gawk.html#Ranges) – Olivier Nov 21 '16 at 11:12
5

For completeness: this works with (simpler) sed too: `sed -n '/select/,/from/p' whatever.sql` – Joshua S Sep 05 '22 at 18:41

score 8 · Answer 3 · answered Sep 15 '10 at 13:11

8

Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.

Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.

I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.

In outline, for a single file:

$/ = "\n\n";    # Paragraphs

while (<>)
{
     if ($_ =~ m/SELECT.*customerName.*FROM/mi)
     {
         printf file name
         go to next file
     }
}

That needs to be wrapped into a sub that is then invoked by the methods of File::Find.

answered Sep 15 '10 at 13:11

Jonathan Leffler

730,956
141
904
1,278

2

Grep does not work one line a time. It searches through the entire corpus for matches, and only when it finds a match does it go back to consider whether a newline is in the middle. That way, it doesn't have to scan through the corpus looking for new lines (which would slow it down significantly) – Squidly Nov 05 '13 at 13:28
@MrBones: there's a chance that a modern implementaton of `grep` does as you say using `mmap()` to read the file into memory, but its mode of operation is defined by the POSIX specification for [`grep`](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html) and it decidedly works in terms of lines. I'm not convinced though; if the file is multiple gigabytes, there is no need to memory map it all when you can simply read in a few kilobytes at a time (most files with lines have lines that are less than kilobytes long). Then there's JSON files, of course, but they're exceptional. – Jonathan Leffler Nov 05 '13 at 14:02
1

It works in terms of lines, but it doesn't work one line at a time. There's not a loop doing some kind of `(for line in lines: doesMatch(line))`. It's more obvious when considering fgrep (fixed strings), and how boyer-moore works. mmap isn't really relevant – Squidly Nov 05 '13 at 14:08
that's a lot of problems on top of the original problem, plus the problem of the regex! – user3791372 Aug 08 '17 at 00:07
1

@Squidly Whether or not that's true does not change the fact that it considers a line at a time. How something is programmed doesn't equate to how it works does it? – Pryftan Sep 23 '19 at 00:31
Grep works on one line at a time. If you use -z then its "one line" is interpreted as the whole file since there aren't any NUL's in there for it to find to delineate. – rogerdpack Jan 20 '22 at 20:23

Regex (grep) for multi-line search needed

3 Answers3

Linked