1

I have a plain text file with a one string per line. I'd like to identify any instances where a string contains a value outside of a restricted character set. In this particular instance, if the string contains any character outside of the set "[THADGRC.SMBN-WVKY]" I want to retain it and pass it along to a new file.

For example, let's say the original file "mystrings.txt" contained the following data:

THADGRC.SMBN-WVKY
YKVW-NBMS.CRGDHAT
THADGRC.SMBN-WVKYI

My intention is to retain only the third sequence, because it contains a character outside of the allowed set (I) in this case.

It doesn't matter how many times, or in what order, an allowed character is present - all I care about is if a character exists in that string outside of the allowed set.

Originally I tried:

cat mystrings.txt | grep -v [THADGRC\.SMBN-WVKY] > badstrings.txt

but of course the third string contains those allowed character in addition to the non-allowed characters, thus this search ended up producing no "offending" strings.

Last thing: I'm not sure what characters outside of the allowed set might exist in this text file. It would be great to know ahead of time to just search for anything with an "I", but I don't actually know this ahead of time.

So the question: is there a way to use grep (or another tool, say awk?) to pass in a restricted list of characters, and flag any instances where a string contains any number of characters outside of that set?

Thanks for your consideration

Devon O'Rourke
  • 237
  • 2
  • 11
  • 1
    Does this answer your question? [How do you escape a hyphen as character range in a POSIX regex](https://stackoverflow.com/questions/28495913/how-do-you-escape-a-hyphen-as-character-range-in-a-posix-regex) – Wiktor Stribiżew Jul 08 '20 at 13:44

2 Answers2

3

I think that your problem is N-W. This doesn't match "N", "-" and "W", it matches a range from "N" to "W". You should move "-" to the end of the character class, or escape it. I suggest changing to:

grep '[^THADGRC.SMBNWVKY-]' mystrings.txt

Also, note that "." doesn't have to be escaped when it's inside a character class.

Maroun
  • 94,125
  • 30
  • 188
  • 241
  • Yeah, thanks very much. From man grep: ``` Character Classes and Bracket Expressions A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list; if the first character of the list is the caret ^ then it matches any character not in the list. For example, the regular expression [0123456789] matches any single digit. ``` Testing this out now and hoping it works. Appreciate it – Devon O'Rourke Jul 08 '20 at 12:52
  • 1
    If indeed the dash should be matched literally, it needs to be the first or the last character in the set. – tripleee Jul 08 '20 at 12:53
  • There's no reason to use double quotes around that bracket expression as it doesn't contain anything you need the shell to expand. Just use the default single quotes. – Ed Morton Jul 08 '20 at 21:10
2

Your attempt says "remove any lines which contain one of these characters at least once". But you want "print any lines which contain at least one character not in this set."

(Also, quote your regular expressions , and lose the useless cat.)

grep '[^-THADGRC.SMBNWVKY]' mystrings.txt > badstrings.txt

I moved the dash to the beginning of the character class on the assumption that you want a literal dash, not the regex range N-W (i.e. N, O, P, Q, R, S, T, U, V, W).

tripleee
  • 175,061
  • 34
  • 275
  • 318