0

I'm curently writing a simple Bash script. The idea is to use grep to find the lines where a certain pattern is found, within some files. The pattern contains 3 capital letters at the start, followed by 6 digits; so the regex is [A-Z]{3}[0-9}{6}.

However, I need to only include the lines where this pattern is not concatenated with other strings, or in other words, if such a pattern is found, it has to be separated from other strings with spaces.

So if the string which matches the pattern is ABC123456 for example, the line something ABC123456 something should be fine, but somethingABC123456something should fail.

I've extended my regex using the [:space:] character class, like so:

[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]

And this seems to work, except for when the string which matches the pattern is the first or last one in the line.

So, the line something ABC123456 something will match correctly;

The line ABC123456 something won't;

And the line something ABC123456 won't as well.

I believe this has something to do with [:space:] not counting new lines and carriage returns as whitespace characters, even though it should from my understanding. Could anyone spot if I'm doing something wrong here?

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
lebchik
  • 21
  • 4
  • Note that you are _not_ asking about your actual problem `I need to only include the lines where this pattern is not concatenated with other strings`, you are specifically asking about `[:space:]` in Bash. – KamilCuk Mar 19 '22 at 18:04
  • 1
    Is the string a line from a file or the whole contents of a file? If it's a line, there won't be any leading/trailing newlines. – glenn jackman Mar 19 '22 at 18:06
  • 1
    `grep` doesn't care which shell you are using, and Bash can't control how `grep` understands its pattern argument. – tripleee Mar 19 '22 at 19:48
  • `grep` only processes a line at a time. The separators between those lines are never present in its buffer at all; it only evaluates your expression _against each line_, one at a time. – Charles Duffy Mar 19 '22 at 20:01
  • Personally, I often use `(^|[[:space:]])` and `($|[[:space:]])`. – Charles Duffy Mar 19 '22 at 20:02
  • (and as was said: this has _nothing whatsoever_ to do with bash; bash doesn't provide grep or control its behavior; your question hinges on grep behavior, not on anything specific to bash) – Charles Duffy Mar 19 '22 at 20:03
  • Thank you for the answers. I apologize about the question not being about bash in specific, I wasn't sure what else to put it under as it marginally involved both bash and `grep`. The solution to my particular issue was to use word boundary `\b` in my regex. – lebchik Mar 20 '22 at 16:25

1 Answers1

0

A common solution to your problem is to normalize the input so that there is a space before and after each word.

sed 's/^ //;s/$/ /' file |
grep -oE '[[:space:]][A-Z]{3}[0-9}{6}[[:space:]]'

Your question assumes that the newlines are part of what grep sees, but that is not true (or at least not how grep is commonly implemented). Instead, it reads just the contents of each new line into a memory buffer, and then applies the regular expression to that buffer.

A similar but different solution is to specify beginning of line or space, and correspondingly space or end of line:

grep -oE '(^|[[:space:]])[A-Z]{3}[0-9}{6}([[:space:]]|$)' file

but this might not be entirely portable.

You might want to postprocess the results to trim any spaces from the extracted strings, too; but I have already had to guess several things about what you are actually trying to accomplish, so I'll stop here.

(Of course, sed can do everything grep can do, and then some, so perhaps switch to sed or Awk entirely rather than build an elaborate normalization pipeline around grep.)

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thank you. This is useful and I'll keep it in mind for future references. However, the solution to my issue was simply using the word boundary `\b` in my regex, as I stated in a previous comment. – lebchik Mar 20 '22 at 16:27