Regex matching any of the strings coupled with a certain string but NOT containing a third string

Question

I want to match all directories that contain words from a list AND the word test but never word DAT.

EB80
TF90
UI11
POSPO02

Therefore, the string is a match if any of the above patterns are in it and the word test is also in the string. But the string DAT should NEVER be anywhere in the match.

I have this regex but it does not seem to be working correctly:

EB80 | TF90 | UI11 | POSPO02 [^DAT]test$

find . -regextype sed -regex "EB80 | TF90 | UI11 | POSPO02 [^DAT]test$"

Are you looking for directories where the directory name itself matches these conditions, or are you including names of the directory's content? I assume at this point you mean the directory name itself. — Paul Hodges, Jul 11 '18 at 13:41
Also, your attempt implies specific positioning of the fields. Could "test" be before the other items? Could "DAT"? — Paul Hodges, Jul 11 '18 at 13:42
@PaulHodges Directories, sub-directories, sub-sub-directories, etc. — paropunam, Jul 13 '18 at 14:08
@PaulHodges The order does not matter. IDs from the list can occur anywhere, so can the word `test`. — paropunam, Jul 13 '18 at 14:14

jas · Accepted Answer · 2018-07-13T16:57:52.717

2

Not particularly elegant but with basic find:

$ ls
DATtestTF90 EB80test    POSPO02test UI11

$ find . -name "*DAT*" -prune -o -name "*test*" \( -name "*EB80*" -o -name "*TF90*" -o -name "*UI11*" -o -name "*POSPO02*" \) -print
./POSPO02test
./EB80test

The arguments to find can be understood as:

-- If the name matches "*DAT*" stop! (-prune) and proceed no further (see also: What does -prune option in find do?)

-- Otherwise, (-o), if the name matches "*test*" AND the name contains any one of the given patterns, output the name (-print)

The parentheses work like you'd expect in a typical programming language. By default any two predicates have an AND relation, but this can be overidden with -o to give an OR relationship. The parens, in the words of the man page, are used to "Force precedence", again as I'm sure your used to in other languages. Hence you can read the second part of the find as

name == "*test*" AND (name=="*EB80*" OR name=="*TF90*" OR name=="*UI11*" OR name=="*POSPO02*")

Note that because the parentheses have meaning for the shell, they need to be escaped so that find receives them in tact.

edited Jul 13 '18 at 16:57

answered Jul 11 '18 at 10:12

jas

10,715
2
30
41

This seems to be doing what I exactly need. It is very fast and it matched all of the directories that had any of those IDs in their name and had `test` in the name as well. I do not exactly understand how it works, especially the part within `\( ... \)` is the most confusing section for me. But I would definitely accept this as answer if you could expand on this as much as the other detailed answers, and perhaps add a programatical way to automatically write the part within `\( ... \)` from a file without having to re-type all of those ID's in the `find` command. – paropunam Jul 13 '18 at 14:14
1

Added some more explanation, and in doing so I realized this might not meet your specifications. Is it possible there could be a subdirectory of a directory with "DAT" in the name that you do want to output? Imagine `DATtest1/DATtest2/EB80test`. In this solution, directories with DAT in the name and everything below them will be ignored! – jas Jul 13 '18 at 17:05
No that's not possible. Therefore, your solution does fulfill the task I require. But thanks a lot for mentioning that. I've already tested your solution extensively on many directories and it indeed produces the exact list I need. I am going to accept this as the correct answer. – paropunam Jul 16 '18 at 13:19

score 1 · Answer 2 · answered Jul 11 '18 at 13:05

You can't express in a single regexp (a or b) and c and !d where those chars are actually strings. Even if they were just chars trying to express it in a single regexp would be a convoluted mess if it were possible at all. [^DAT] means not (D or A or T) btw - [] is a bracket expression and as such contains sets of characters, not strings.

You should consider using awk to match the condition you care about for post-processing the find output. It'd simply be:

find . -type d -print |
awk '/EB80|TF90|UI11|POSPO02/ && /test/ && !/DAT/'

because it's trivial to write what you need as a condition, but not as a single regexp. If your file names can contain newlines then with GNU find and GNU awk just use NUL as the file name terminator instead of newline:

find . -type d -print0 |
awk -v RS='\0' '/EB80|TF90|UI11|POSPO02/ && /test/ && !/DAT/'

Obviously you can add some of the condition to the find and take it out of the awk if you care for efficiency but you might find it easier to maintain if you have your whole condition in one place like above.

score 0 · Answer 3 · answered Jul 11 '18 at 13:59

0

Some people will argue that I'm spawning too many procs, but sometimes readability matters, too, and since you didn't explicitly say one way or another I'm going to assume that order of these strings isn't relevant. How about -

find . -type d -name \*test\* | 
  grep -v DAT | egrep "EB80|TF90|UI11|POSPO02"

A quick test -

$: mkdir footestbar
$: mkdir footestbarDAT
$: mkdir footestbarDATEB80
$: mkdir footestbarEB80
$: find . -type d -name \*test\* |
>       grep -v DAT | egrep "EB80|TF90|UI11|POSPO02"
./footestbarEB80

answered Jul 11 '18 at 13:59

Paul Hodges

13,382
1
17
36

`egrep` is deprecated and you should use `grep -E` instead, but you should also consider just using `awk '/EB80|TF90|UI11|POSPO02/ && !/DAT/'` instead of 2 greps and a pipe anyway. – Ed Morton Jul 12 '18 at 00:52
I always consider it. I despise awk. I'll write a whole program in Perl before I knock off a one-liner in awk - but that's just irrational personal prejudice on my part. I don't argue that it's likely the smarter solution in a lot of cases, this being one. – Paul Hodges Jul 12 '18 at 13:23
1

Good to hear you know it's irrational :-). If at some point you'd like to address the reasons why you despise it, feel free to post a question and I'll be happy to respond - **maybe** you suffered early exposure to some nightmarish code and it's not actually as bad as you think! I avoided awk for the first 10 years of of my career due to seeing horrendous scripts and only having old, broken awk available. It took **having** to maintain someone elses huge awk script I inherited to get me over the hump and start appreciating it. – Ed Morton Jul 12 '18 at 13:32
Wish I'd seen more examples like the one you used in your comment here. That's simple and reasonable and useful as hell...but no, awk was just never something I needed for the first ten years or so of working in `*NIX`, and when it was I just used Perl, which I was already using on a daily basis anyway, and which can do all the same stuff. I just didn't want to have to pollute my brain with *TWO* boroque versions of obscure, metacharacter-heavy tools when one would handle it. For something those guys initially wrote over a weekend, it's an amazing tool that has stood the test of time. ;) – Paul Hodges Jul 12 '18 at 13:41
For most of my 35+ year career perl simply wasn't available on the machines we used so we had to stick with standard UNIX tools and when I finally got over the hump from sed+grep+shell+etc. to awk it was a breath of fresh air. So far I haven't come across anything I need to use perl for and to be honest I find the idiomatic syntax people post here and in other forums horrifying but I can certainly understand the desire to only learn 1 tool if it does everything you need. – Ed Morton Jul 12 '18 at 13:47
1

Yeah - Perl is certainly a mindset of its own, lol -- my language of choice when a shell just isn't quite enough, but not for everybody. :D – Paul Hodges Jul 12 '18 at 13:52

Regex matching any of the strings coupled with a certain string but NOT containing a third string

3 Answers3