2

I would like to match the following expression in bash:

^.*(\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\b).*$

Really all I want to know is whether one of the words of the string tested is one of the words described in this regex (720p, 1080p, brrip, ...). And there seems to be an issue with the word boundaries.

The test I use is [[ $name =~ $re ]] && echo "yes"where $name is any string and $re is my regex expression.

What am I missing?

codeforester
  • 39,467
  • 16
  • 112
  • 140
Sheraff
  • 5,730
  • 3
  • 28
  • 53
  • Single-quotes -- as in `re='yadda yadda yadda'` -- will not interfere destructively with your backslashes. – Charles Duffy Dec 15 '14 at 02:28
  • 1
    I don't understand the downvotes: the accepted answers in it's first line explicits why this needs answering. \b is a PCRE extension; it isn't available in ERE, which the =~ operator in bash's [[ ]] syntax uses. – Sheraff Dec 20 '14 at 12:47
  • To add to this, `Bash-3.0` to `Bash-3.1` used PCRE syntax for sure, which can be enabled in `Bash-4.0` and later versions using `shopt -s compat31`. – Samveen Jul 13 '16 at 04:34
  • Ignore my previous comment. Looks like the accepted answer caused me confusion. Fixing it with an answer. – Samveen Jul 13 '16 at 04:58
  • @Samveen, eh? It's not PCRE syntax as such that's throwing you off, but vendor extensions to ERE adding a (single, specific) feature that originated in PCRE. Bash uses the local operating system's libc, so it implicitly picks up all extensions your OS vendor chooses to provide. – Charles Duffy Jul 13 '16 at 15:01
  • @Samveen you can delete your own comments. – styrofoam fly Jul 04 '18 at 22:29

2 Answers2

5

\b is a PCRE extension; it isn't available in POSIX ERE (Extended Regular Expressions), which is the smallest possible set of syntax that the =~ operator in bash's [[ ]] will honor. (An individual operating system may have a libc which extends this syntax; in this case those extensions will be available on such operating systems, but not on all platforms where bash is supported).

As a baseline, the \b extension doesn't actually have very much expressive power -- you can write any PCRE that uses it as an equivalent ERE. Better, though, is to step back and question the underlying assumptions: When you say "word boundary", what do you really mean? If all you care about is that if this starts and ends either with whitespace or the beginning or end of the string, then you don't need the \b operator at all:

(^|[[:space:]])((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))($|[[:space:]])

Note that I took out the initial ^.* and ending .*$, since those constructs are self-negating when doing an otherwise-unanchored match; the .* makes the ^ that immediately precedes it meaningless, and likewise the .* just before the final $.


Now, if you want an exact equivalent to \b when placed immediately before a word character at the beginning of a sequence, then we get something more like:

(^|[^a-zA-Z0-9_])

...and, likewise, when immediately after a word character at the end of a sequence:

($|[^a-zA-Z0-9_])

Both of these are somewhat degenerate cases -- there are other situations where emulating the behavior of \b in ERE can be more complicated -- but they're the only situations your question appears to present.

Note that some implementations of \b would have better support for non-ASCII character sets, and thus be better described with [^[:alnum:]_] rather than [^a-zA-Z0-9_], but it's not well-defined here which implementation you're coming from or comparing against.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • That's why I detailed my problem, I assumed there was another way. But `[:space:]` is a little too restrictive, you should say `[^a-z0-9]` instead. <- this was written before your edit – Sheraff Dec 15 '14 at 02:28
  • Using `(^|[^a-zA-Z0-9_])` and `($|[^a-zA-Z0-9_])` absorbs the character. It's probably fine here, but e.g. non-GNU sed can't equivalate `echo foo foo foo foo |sed 's/\bfoo\b/bar/g'` with `echo foo foo foo foo |sed -E 's/(^|[^[:alnum:]_])foo($|[^[:alnum:]_])/\1bar\2/g'` because there's only once space between those `foo`s; you'll end up with `bar foo bar foo`. Update: Charles Duffy has [a solution to this issue](https://stackoverflow.com/a/22261454/519360) using GNU regex and therefore `\b` is available. – Adam Katz Apr 20 '22 at 03:02
2

The accepted answer is erroneous may be erroneous on two minor points:

  • As far as I can make out, \b and '\<|>' (word boundary matching) is not a PCRE innovation. Then again, I am unable to trace the introduction of word boundary matching in RE engines, so it may as well have been Perl.
  • As correctly stated in the answer, POSIX EREs dont support word boundary matching. However, all modern regular expression engines do provide word boundary matching as part of basic REs, not just ERE: you just need to find the syntax.

That said, this answer is very specific to Linux builds of Bash (with a final MacOSX specific section, which may apply to all BSD derivatives as well).

  • By definition, GNU Regular Expressions (RE) supports both \b and \<|\> as word boundary(grep syntax). It is not a Perl Compatible Regular Expression extension, AFAIK. [1]

  • Bash has supported GNU Extended RE (grep -E syntax) since 3.0. [2]

  • Thus for all versions of Bash >= 3.0, [[ " h " =~ '\bh\b' ]] && echo yes || echo no should give me yes. It does not (see the next points).

  • In Bash versions 3.0 through 3.1, [[ " h " =~ '\bh\b' ]] && echo yes || echo no will give me yes. Notice that the pattern itself is the right hand side (RHS) argument of the =~ operator. [2]

  • Bash-3.2 changed quoting rules for the match operator =~. [2]

  • Since Bash-3.2, the pattern should ideally be stored in a variable and the variable should be supplied as the RHS argument to the =~ operator: pat='\bh\b' ; [[ " h " =~ $pat ]] && echo yes || echo no. The reason is that the quoting rules changed, so that if the pattern is supplied inside quotes('' or ""), the pattern is interpreted as a string instead of a regex. [2]

Finally, your pattern is correct, it's just a weird quoting issue:

[samveen@ankhmorpork ~]# echo $BASH_VERSION
4.2.46(1)-release
[samveen@ankhmorpork ~]# re='^.*(\b((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))\b).*$'
[samveen@ankhmorpork ~]# for i in 720p 1080p brrip; do
>      [[ $i =~ $re ]] && echo yes for $i || echo no for $i
> done
yes for 720p
yes for 1080p
yes for brrip

Further, for Bash on MacOSX, the boundary match changes from \b to '[[:<:]](start of word) and [[:>:]](end of word) [3]:

SamveensMBP:~ samveen$ echo $BASH_VERSION
3.2.57(1)-release
SamveensMBP:~ samveen$ re='^.*([[:<:]]((720p)|(1080p)|(((br)|(hd)|(bd)|(web)|(dvd))rip)|((x|h)264)|(DVDscr)|(xvid)|(hdtv)|(ac3)|(s[0-9]{2}e[0-9]{2})|(avi)|(mp4)|(mkv)|(eztv)|(YIFY))[[:>:]]).*$'
SamveensMBP:~ samveen$ for i in 720p 1080p brrip; do
>     [[ $i =~ $re ]] && echo yes for $i || echo no for $i
> done
yes for 720p
yes for 1080p
yes for brrip

References:

[1] GNU grep Manual: Regex section

[2] The Bash FAQ, by it's Author

[3] MacOSX manpage for re_format

Samveen
  • 3,482
  • 35
  • 52
  • If you want to speak of what POSIX ERE -- the standard, not the GNU implementation of that standard supports -- you'd do better to link to the standard itself. – Charles Duffy Jul 13 '16 at 14:51
  • That standard, in its entirety, lives at http://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html -- and there is in fact no `\b` specified. That the GNU implementation of ERE incorporates extensions is unsurprising to anyone, but they're *extensions*, and won't be available (for tools using the C-library matching functions, as bash does) on operating systems not using glibc. (Correspondingly, the GNU grep manual is hardly apropos -- whether glibc and GNU grep honor identical syntax is implementation-dependent). – Charles Duffy Jul 13 '16 at 14:52
  • ...this answer can easily be corrected by amending the wording to speak specifically of the GNU implementation of ERE, but as it stands there's a strong argument to be made that it is not in fact accurate. As for my own answer, I've amended it to refer *specifically* to the POSIX ERE standard, as opposed to any particular vendor's implementation of ERE; hopefully that'll suffice for you to retract your claims of inaccuracy, and the presumptively associated downvote. – Charles Duffy Jul 13 '16 at 14:59
  • Which is to say: The claim made in your answer that bash will always support "GNU ERE" is flat wrong, **and the FAQ you linked to makes no such claim**. To quote: `The '[[' command can now perform extended regular expression (egrep-like) matching` -- there's no guarantee present that it'll be GNU-extended syntax. – Charles Duffy Jul 13 '16 at 15:05
  • ...btw, one such platform is readily available: MacOS. If you want to demonstrate my claims for yourself, run `pat='\bh\b' ; [[ " h " =~ $pat ]] && echo yes || echo no` on identical builds of bash on Linux and a Mac next to each other. – Charles Duffy Jul 13 '16 at 15:10
  • @CharlesDuffy: On a Mac, please try, the following `man 7 re_format| grep '\\b'` and let me know what you find. – Samveen Jul 13 '16 at 17:38
  • Observe, from the text of that same man page: "Like the enhanced regex implementations in scripting languages such as perl(1) and python(1), these additional features may conflict with the IEEE Std 1003.2 (``POSIX.2'') standards in some ways." – Charles Duffy Jul 13 '16 at 17:40
  • More to the point, those behaviors are enabled when the `REG_ENHANCED` flag is passed -- **not** `REG_EXTENDED` -- which is why bash (which passes only `REG_EXTENDED`) doesn't support them on MacOS X. – Charles Duffy Jul 13 '16 at 17:41
  • ...thus, it's worth looking at context for a man page snippet, rather than grepping out a single line. – Charles Duffy Jul 13 '16 at 17:42
  • [...incidentally, I must say that I rather prefer Apple's decision to make their extensions require explicit action to enable over GNU's on-by-default behavior; rather less confusion that way]. – Charles Duffy Jul 13 '16 at 17:45
  • @CharlesDuffy Please try this as well: `pat='[[:<:]]h[[:>:]]'; [[ " h " =~ $pat ]] && echo yes || echo no`. The `REG_ENHANCED` isn't needed for word boundary matching on MacOS, just for the `\b`. – Samveen Jul 13 '16 at 18:31
  • That gets you MacOS, but that gives us only two platforms (and even that's if you assume Linux-on-dietlibc and Linux-on-musl-libc to be distinct platforms from Linux-on-glibc). If you want to speak to bash-in-general, as opposed to bash-running-on-two-specific-`libc`-implementations, the only reasonable thing one can do is speak to the letter of the POSIX standard. – Charles Duffy Jul 13 '16 at 19:25
  • That said, as edited, this is in a much better place than it was before; I can no longer justify a downvote on grounds of inaccuracy, though referring to bash as promising "GNU" ERE support is still a bit iffy. – Charles Duffy Jul 13 '16 at 19:27
  • ...so, the "your pattern is correct, it's just a weird quoting issue" isn't something I'm at all sure is actually established here. The OP tells us in the question that they're using `[[ $name =~ $re ]]`, so syntax that works with modern bash, and they accepted an answer which didn't suggest a change to their quoting but *did* suggest a change to their syntax; together, these strongly imply that they were running on a non-GNU platform. If you were to revise your answer to focus on "if you're on MacOS, use `[[:<:]]` and `[[:>:]]` instead of `\b`", that would be a revision I could fully support. – Charles Duffy Jul 13 '16 at 19:32
  • Quick point of note -- it's the "GNU" in "GNU ERE" that I was objecting to, as opposed to the "E" in "ERE". I fully agree with the claim that the initial implementation of `=~` in bash supported ERE. – Charles Duffy Jul 13 '16 at 19:37
  • (Also, I'm not claiming that word-boundary matching originated in PCRE, but that the use of `\b` as syntax for that behavior did). – Charles Duffy Jul 13 '16 at 19:51