2

I have exactly the same question as in this post, however the regex isn't working for me, in bash. RegExp exclusion, looking for a word not followed by another

I want to include all lines of a csv file that include the word "Tom", except when it's followed by "Thumb".

  • Include: Tom sat by the seashore.
  • Don't include: Tom Thumb sat by the seashore.
  • Include: Tom and Tom Thumb sat by the seashore.

The regex Tom(?!\s+Thumb) works when I try it out on regex101.com.

But I've tried all these variations and none of them work. What am I missing and how can I work around this? I'm on a Mac.

cat inputfile.csv | grep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | egrep “Tom(?!\s+Thumb)” > Tom.csv
cat inputfile.csv | grep -E Tom(?!\s+Thumb) > Tom.csv
cat inputfile.csv | grep -E “Tom(?!\s+Thumb)” > Tom.csv

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
gemma
  • 29
  • 1
  • 2
  • Note that `grep -E` only guarantees ERE syntax. `\s` is PCRE; it may or may not work depending on which specific version of `grep` you're using. And `?!` isn't, to my knowledge, supported in _any_ ERE implementation at all. – Charles Duffy May 31 '21 at 21:44
  • (Some platforms have a `grep` that supports PCRE, but you'll need to check `man grep` on your specific target OS; typically, it's `grep -P` to enable the feature). – Charles Duffy May 31 '21 at 21:47
  • 1
    Also, note that `grep` is not part of bash -- it's a separate tool, built by a different team, compiled to a different executable. bash does have its own mechanism to access standard C library regex functionality, but when you use grep, you aren't using bash's regex support -- you're using grep's instead. – Charles Duffy May 31 '21 at 21:48
  • 1
    Also, note that the code copied/pasted into the question uses `“` and `”`. These are not recognized as valid quotes by bash. You **must** use only regular double quotes -- `"` -- for them to be recognized as shell syntax. If some software you use has "smart quotes" turned on, be sure to turn them off before using that program to edit shell scripts. – Charles Duffy May 31 '21 at 21:52
  • That said, the "Tom and Tom Thumb sat by the seashore" example indicates that you need a more powerful tool than `grep -E`. – Charles Duffy May 31 '21 at 21:57
  • 1
    (Note that just knowing that you're on a Mac doesn't tell us what version of `grep` you have, because Mac users can install their own versions of grep with tools like Nix, Macports, or Homebrew -- listed in my personal descending order of preference) – Charles Duffy May 31 '21 at 22:03
  • ...for example, once you've [installed Nix](https://nixos.org/manual/nix/stable/#sect-macos-installation), you can use `nix run nixpkgs.gnugrep -c grep -P ...` to use GNU grep for only a single command line, without changing your system-wide default. (Nix also provides mechanisms to change the software loadout used while working on a specific project; or for a specific user account; or so forth). – Charles Duffy May 31 '21 at 22:48

4 Answers4

7

You can't do this with POSIX ERE.

There is no negative lookahead assertion in POSIX extended regular expressions, which is the syntax grep -E activates.

The closest you can get is to combine two separate regexes, one positive match and one negative:

grep -we 'Tom' inputfile.csv | grep -wvEe 'Tom[[:space:]]Thumb'

grep -v excludes any line that matches the given expression; so here, we're first searching for Tom, and then removing Tom Thumb.

However, the intent to match Tom and Tom Thumb sat by the seashore makes this unworkable. In short: You can't do what you're asking for with standard grep, unless it has grep -P to make your original syntax valid. In that case you could use:

grep -Pwe 'Tom(?!\s+Thumb)' <inputfile.csv >Tom.csv

One hack might be a temporary substitution

Assuming you have uuidgen available (it appears to be present in Big Sur) to generate a temporary, unpredictable sigil:

uuid=$(uuidgen)
sed -e "s/Tom Thumb/$uuid/g" <inputfile.csv \
  | grep -we 'Tom' \
  | sed -e "s/$uuid/Tom Thumb/g" >tom.csv
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • The trouble is, this excludes "Tom and Tom Thumb sat by the seashore", which I want to keep. – gemma May 31 '21 at 21:57
  • Yes, I'm aware. See the extension, which was underway as you added the comment. – Charles Duffy May 31 '21 at 21:58
  • What version of `grep` do you have? Does it have a `grep -P` argument available? – Charles Duffy May 31 '21 at 21:58
  • 1
    (The version number isn't actually enough to know if it has a `-P` argument, insofar as that argument is an extension that's only enabled when GNU grep is compiled against libpcre, which is an optional library dependency; all versions that support it can be compiled either with the library -- and thus the option -- or without it). – Charles Duffy May 31 '21 at 22:01
  • Consider installing GNU grep with one of the package managers I suggested in a comment thread attached to the question. Or you can use the hack at the end of the answer with two `sed`s, one before and one after the `grep`. – Charles Duffy May 31 '21 at 22:44
  • Thanks. It's version 2.5.1. `-P`doesn't seem to be included. – gemma May 31 '21 at 22:54
  • I do have uuidgen available, and tried that method, but the output file is empty. Thanks, will try installing another version of grep. – gemma May 31 '21 at 22:56
  • Can you show me how you tried to deploy the `uuidgen` method, in enough detail that I can reproduce your problem? It works perfectly in the online interpreter at https://ideone.com/i1d1Cb – Charles Duffy May 31 '21 at 23:00
  • When you have a csv-file without special characters, you might use `\v`. In https://stackoverflow.com/a/50271921/3220113 I used a bunch if special characters, I think `\r` can occur in a csv file. `sed -e "s/Tom Thumb/\v/g" ...`. – Walter A Jun 01 '21 at 06:54
  • @WalterA, makes sense as a replacement for the uuid, but I don't believe uuidgen is actually responsible for the OP's problem -- we'd see different symptoms, not an empty output file, if it were. – Charles Duffy Jun 01 '21 at 11:05
  • @gemma, ... Honestly, if I had to guess what was wrong, my guess would be a topographic error around the line breaks (it's critical that there be no whitespace after each trailing backslash, for example). If you made it all one line like `uuid=$(uuidgen); sed -e "s/Tom Thumb/$uuid/g" tom.csv`, does that work? – Charles Duffy Jun 01 '21 at 11:10
  • (ugh; meant `typographic`... how did I make that mistake without a phone's autocorrect? Granted, I do sometimes post from a phone, but the above comment is a pretty gnarly one to type that way). – Charles Duffy Aug 22 '21 at 03:43
2

How about a Perl solution:

perl -ne 'print if /Tom(?!\s+Thumb)/' inputfile.csv > Tom.csv

Perl obviously supports PCRE and pre-installed on Mac.

  • The -n option is mostly equivalent to that of sed. It suppresses the automatic printing.
  • The -e option enables a one-liner by putting the immediate code.
  • The code print if /pattern/ is an idiom to print the matched line, which may substitute grep command.
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • Thank you for the feedback. Good to know it works. If you feel my answer solves well your problem, I'd appreciate if you can accept my answer by clicking on the check mark beside the answer. BR. – tshiono Jun 01 '21 at 01:43
1

Keep it simple and just use awk, e.g. using any awk in any shell on every Unix box:

$ awk '{orig=$0; gsub(/Tom Thumb/,"")} /Tom/{print orig}' file
Include: Tom sat by the seashore.
Include: Tom and Tom Thumb sat by the seashore.
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

Grep can use Perl regular expressions (PCRE). From man grep:

-P, --perl-regexp

Interpret PATTERNS as Perl-compatible regular expressions (PCREs). This option is experimental when combined with the -z (--null-data) option, and grep -P may warn of unimplemented features.

Jimmy
  • 27,142
  • 5
  • 87
  • 100
  • This is the most direct answer to the OP's question; maybe wiith a little caveat about cross-distro compatibilities. I mean you can just replace 'E' with 'P' in their latter examples – Rondo Nov 21 '21 at 04:07