bash: constructing complex regexp for find

Question

Cleaning old backups, and found some results of something wrong. One user's backup contains files with strange names like "37&@4ez98d". In order to automating the cleaning process I tried to find all such files and did that with such regexp:

find -regextype sed -regex '.*\/[[:digit:]a-z[:punct:]]\{10\}'

All these names are of 10 characters long, and contains digits, small latins and some punctuations. The find worked almost perfectly, but it also found some files with the "legal" names like 07-709.pdf. And I can not construct the regexp like "anywhere inside given subtree, 10 characters include digits, small latins and SOME punctuations except for dot and minus sign"

I tried everything I could, but I could not make find to ignore the minuses and dots. These symbols may appear anywhere inside the file name, so I can't rely on their fixed placement. Placing something like [^.] (in any variations) produced no usable results. Grepping the find's results for dots and minuses is also useless because these symbols may occur in directories' names, and filtering these out may filter out the "bad" filenames also. I can not enumerate all punctuations possible because I can miss something: I have no idea what "alphabet" was used to scramble these names, while I'm pretty sure that it does not contain dots and minuses.

I managed to workaround the problem, pipelining find's output to some additional checking routine (it was one-liner, additional newlines were inserted for readability only):

find -regextype sed -regex '.*\/[[:digit:]a-z[:punct:]]\{10\}'| \
while read a; do \
b=${a: -10}; [[ ! "$b" =~ .*[\-\.]+.* ]] && echo $b \
done

but the trick I need is the single regexp.

Any suggestions please?

Some real data for tesing (four first are to be found, three latter are to be ignored):

rxoxywiy7l
u29t@5%0qd
im^ua&saeo
y6mxn2wnkb
07-709.pdf
3023-7.pdf
18099.docx

Thank you.

You can just use two `-regex` tests, `find` will "and" them together (i.e. only show files matching both): `find ... -regex 'Exactly10CharsPattern' -regex 'AtLesastOneWeirdChar' ...` — Gordon Davisson, Nov 23 '22 at 05:31
@GordonDavisson I tried: ...{10\}' -regex '^[\.\-]' and even ...{10\}' -regex '^a'. Both returned nothing at all. ...'![\.\-]' and ...'[!\-\.] did the same: no results. — Troublemaker-DV, Nov 23 '22 at 05:50
You sure you wanna find no. 4, `y6mxn2wnkb`? That's a totally legit file name. Perhaps you are looking for things wit extensions? One thing is that you should use the information of a length of exactly 10. — Peter - Reinstate Monica, Nov 23 '22 at 06:50
@Peter-ReinstateMonica I need to find all first four names, and no one from three last ones. The name is legit, yes, but not in this case - such files can't appear in the dir subtree where user places his documents. No one in sane mind will name the document like "np%1k'cph&" (one from found names). In total in this user's subtree I found 85 10-character files, where 81 were "scrambled" — Troublemaker-DV, Nov 23 '22 at 06:55
The thing i wanted to point out is that your criteria are a bit fuzzy and there's likely no perfect solution. Yes, all files that are exactly 10 characters long and don't have a file extension (defined by a dot and a minimum of 1 and a maximum of 4 letters before the end) and contain "weird" punctuation: We are sure those are bad. All that have a file extension and consist only of numbers and hyphens in the base name: Those are good. But can a "good" file also contain letters? (Your regex indicates that.) No numbers at all? No file extension? — Peter - Reinstate Monica, Nov 23 '22 at 07:27
I don't think, regextype _sed_ can do `[:punct:]`. Try `-regextype posix-extended` or `-regextype egrep`. — user1934428, Nov 23 '22 at 07:29
Bottom line: If you cannot afford to throw away the occasional "good" file I'd define a restrictive pattern to match only those bad ones we are 100% sure about: Those that match all criteria. Those that meet *some* criteria may need to be be manually checked. — Peter - Reinstate Monica, Nov 23 '22 at 07:30
Also, the backslashes in `\{10\}` look wrong to me. You do **not** want to regex-escape the curly braces. — user1934428, Nov 23 '22 at 07:30
the regex worked fine, the problem was the wrong approach, see the marked answer @user1934428 — Troublemaker-DV, Nov 24 '22 at 03:33
I think `posix-extended` would also have worked. Did you try it? It's simpler than explicitly writing the punctuation characters (as in the accepted answer). — user1934428, Nov 24 '22 at 07:27

tripleee · Accepted Answer · 2022-11-23T06:46:14.233

4

If you are not happy with the semantics of [:punct:] you need to spell out which punctuation characters exactly you want to match.

Quick Duck Duck Going gets me [][!"#$%&'()*+,./:;<=>?@\^_`{|}~-] for the full character class, so excluding dot and minus, try

find -regextype sed -regex '.*\/[][!"#$%&'"'"'()*+,/:;<=>?@\^_`{|}~[:digit:]a-z]\{10\}'

(I had to move the punctuation to the front for simplicity, and break out the single quote into a separate double-quoted string).

As an aside, piping find output to while read is prone to some complications; probably prefer -exec basename {} + or something similar to print the file names. (GNU find also has a -printf operator with a rich set of format codes.) See also https://mywiki.wooledge.org/BashFAQ/020

As for grepping the results from find, you can easily anchor the regex to anything after the last slash.

find -type f -name '??????????' |
grep '/[^/.-]*$'

(again subject to the various caveats of the FAQ I linked above) ... though as @oguzismail notes, this can be simplified to just

find -type f -name '??????????' ! -name '*[.-]*'

or even

find -type f -name '[!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-][!.-]'

If you wanted to go all-in, you could use the Unicode database to extract all characters which count as punctuation; this is still in theory subject to the whims of the locale of the process which generated these file names (which you can't know) but probably in practice quite sufficient. If your find supports a -regextype which implements Perl / PCRE semantics, you could even use the Perl Unicode escape \L{Po} (but alas, it apparently doesn't). Here's a list but notice also the various other punctuation classes on the category page.

edited Nov 23 '22 at 06:46

answered Nov 23 '22 at 05:16

tripleee

175,061
34
275
318

Some regex variants let you remove things from a character class but `sed` certainly does not belong to them. For curiosity, see e.g. https://stackoverflow.com/questions/17327765/exclude-characters-from-a-character-class – tripleee Nov 23 '22 at 05:17
Alas. Quote: "I can not enumerate all punctuations possible because I can miss something: I have no idea what "alphabet" was used to scramble these names". Also, some characters can not appear in the file names in question because the backup folder located on NTFS voulume on W...dows Server 2k8 with all naming limitations. That's why I tried to exclude only two permitted but unwanted symbols. – Troublemaker-DV Nov 23 '22 at 05:51
I'm afraid that isn't really solvable with the information you have supplied. The precise definition of the`[[:punct:]]` class depends on your current locale anyway, not on whatever other process on possibly another system used at a previous point in time. – tripleee Nov 23 '22 at 06:02
as to the "curiosity" link I had read it before asking my Q. I found this article non-applicable in my case. Am I wrong? – Troublemaker-DV Nov 23 '22 at 06:04
(I'm afraid...) agree. And that's what I wrote: I don't care about all other punctuations except for two symbols. As to echoing in my example of workaround, that was example only - I could do everything I need with appropriate file. But I must be sure that this file is one I looking for. – Troublemaker-DV Nov 23 '22 at 06:07
`read` without `-r` will mangle backslashes. If the path contains spaces or shell metacharacters (or newlines, of course) the FAQ still applies. – tripleee Nov 23 '22 at 06:07
2

grep isn't necessary, `find -type f -name '??????????' ! -name '*[.-]*'` does the same. – oguz ismail Nov 23 '22 at 06:26
Shame on my grey head! :-) Thank you both, @oguzismail & tripleee I forgot about q-marks as char placeholder in file names, while used these too much when programming the .bat scripts about 20yo ago. – Troublemaker-DV Nov 23 '22 at 06:43
That's what I love *nix and bash: for their flexible set of data-processing tools, and for "unix-way": "single utility must perform one task. but it must to perform perfectly" – Troublemaker-DV Nov 23 '22 at 06:50

bash: constructing complex regexp for find

1 Answers1