1
#!/bin/bash
i="a001"
if ! [[ $i =~ "a[0-9]{3}"  ]]; then
    echo "success"
fi

input i="a001" makes it success, although it does match a[0-9]{3} pattern and if statement should not be executed. What is going on? in my opinion it is compiler mistake. It is a part of bigger problem that I have not solved since yesterday.

pfnuesel
  • 14,093
  • 14
  • 58
  • 71
user3162968
  • 1,016
  • 1
  • 9
  • 16
  • 3
    `bash` is not a compiler - it is an interpreter – Ed Heal Apr 20 '16 at 18:43
  • okay, but what is going on with my regular expression? no such problems in other languages like PHP – user3162968 Apr 20 '16 at 18:44
  • 3
    This `if !` looks an awful lot like _IF NOT_ to me, echoing 'Sucess' only on a failure ? –  Apr 20 '16 at 18:46
  • I tried to write `if not` but it did not compile, now it seems to work with solution posted by Ignacio Vazq... – user3162968 Apr 20 '16 at 18:49
  • @user3162968, I think sln was asking if your logic was a true representation of your intent, as opposed to saying that the `!` was invalid or incorrect syntax (which, indeed, it's not). It's unusual for a *failure* to match to be the *success* case, because it implies that you have full knowledge of all possible error conditions, which isn't a typical scenario. – Charles Duffy Apr 21 '16 at 14:48

2 Answers2

4

For consistent behavior across all bash versions having an =~ operator in [[ ]], put your regex in a variable and use the variable unquoted on the right-hand side of that operator:

i="a001"
re="a[0-9]{3}"
if ! [[ $i =~ $re ]]; then
    echo "success"
fi
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
3

Quotes escape the metacharacters in the regex, and so shouldn't be included here.

$ i="a001"
$ [[ $i =~ "a[0-9]{3}"  ]] ; echo $?
1
$ [[ $i =~ a[0-9]{3}  ]] ; echo $?
0
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • I think this is correct but a bit misleading due to how it's worded. "Does not require quotes" implies that quotes are redundant, meaning it's okay to quote, just so long as you don't mind some extra typing. As your code shows, that's incorrect, the quotes are wrong. Your answer doesn't yet really explain why. –  Apr 20 '16 at 18:51
  • Wow, I'd like to see how they parse those delimiters around the arguments especially since regex is so metacharacterized. –  Apr 20 '16 at 18:52
  • it seems to work, thank you, i could not solve this exercise since yesterday – user3162968 Apr 20 '16 at 18:53
  • Case in point: I'm looking for a pattern that matches '`"a001"`' and the regex I'm using is `"a[0-9]{3}"` –  Apr 20 '16 at 18:55
  • What about spaces. I want to match ` ` what do I put in ` ` ? How do I match `]]` ? How many more conflicts with whatever the parse uses for delimiter concept ? –  Apr 20 '16 at 18:57
  • @IgnacioVazquez-Abrams: if the quotes were part of the pattern then `i='"a001"' should match, but it does not (try it). You have to escape the quotes to get them to match. – cdarke Apr 20 '16 at 19:16
  • No, that wasn't what I meant. The quotes are not part of the pattern. However, the characters inside the quotes are remembered as being quoted, which disables any special meaning they might otherwise have. @sln To match a literal space, you need to quote/escape it. But that doesn't mean you need to quote/escape the whole pattern. `[[ "a b" =~ ^." ".$ ]]; echo $?` or `[[ "a b" =~ ^.\ .$ ]]; echo $?` It should be clear how this extends other special characters or character combinations such as `]]`. –  Apr 20 '16 at 19:16
  • @hvd IgnacioVazquez-Abrams said "the quotes are always part of the pattern", and I was about to comment what you have just said. – cdarke Apr 20 '16 at 19:19
  • Leading or trailing space (which was asked) can be matched by preceding with a `\\`. – cdarke Apr 20 '16 at 19:20
  • @sln You put the regex into a separate variable, see other answer. – Benjamin W. Apr 20 '16 at 20:09
  • @hvd - Oh yea, that's getting interesting. Now a quote is not a literal quote. Let me get this straight now, anything between quotes is a literal, not a regex. They have circumvented the regex parser and are now interpolating strings in the middle of the regex string that is not delimited with any kind of meaning. So =~ `"^\\\\\11$"` will literally match `^\\\\\11$`. I'm getting the feeling they tried to shoe horn regex into their interpreter and didn't think it through. –  Apr 20 '16 at 20:10
  • @sln The regular rules for single and double quoted strings apply, which say that inside a double-quoted string, a backslash is special, and a double backslash is an escaped literal backslash. Which is why `[[ '^\\\\\11$' =~ "^\\\\\11$" ]]` returns `1`, not `0`. The rules are pretty much the same as for the more limited pattern matching inside `case` statements, which were well-established by the time bash added regexes. –  Apr 20 '16 at 20:13
  • @hvd - I don't care when they did it, they did it wrong. A string inside an expression, as the parser see's it should be accompanied by an operator. Utterly absurd. To not have a real delimiter is a conceptual error, that frankly, I don't think they care a lot about anyway. Too much command line mentality, but totally void of regex knowledge. I'd just pass in the ignore whitespace modifier, but they'd just ignore it. –  Apr 20 '16 at 20:17
  • @sln, this syntax gives you things you *can't do* inside languages supporting only traditional regex syntax, mixing auto-escaped literals and regex components inside a single match. Consider: `[[ $str =~ (^|foo)"$literal"(bar|$) ]]`; even if `$literal` would otherwise be a regex, because it's quoted, the contents of the variable are auto-escaped to only match the literal characters contained. It's a feature, not a bug. – Charles Duffy Apr 21 '16 at 14:45
  • @sln, ...as for your previous questions (about spaces, `]]`, etc), those are all addressed by the practice given in my answer (putting the regex in a variable rather than passing it as a literal). If one wants to do that *in conjunction* with what I described above, then `[[ $str =~ ${re_prefix}"$literal"${re_suffix} ]]` will have that effect. – Charles Duffy Apr 21 '16 at 14:50
  • @CharlesDuffy - That's cool and everything, but in order to use the spiffy _feature's_, I'd have to sit down and analyze my really big and diverse regex and figure out what, and where I would have to chop up, triple escape in order to get my regex to work in an initial environment of delimiters , `[[`, `]]`, and balanced double quote's `"`. Just imagine trying to _debug_ that regex catastrophe ! Btw, in no language I know is the allowed as a regex delimiter. And regular expressions are a language, not a string. –  Apr 21 '16 at 18:27
  • @sln, spaces aren't a "regex delimiter" in bash either, because bash has no syntax-level regex primitive. Instead, just like in Python, regexes can be specified *by passing strings around*; and bash uses spaces to delimit unquoted strings. – Charles Duffy Apr 21 '16 at 19:36
  • @sln, ...that strings carry with them per-character quoting data (albeit only accessible to language syntax such as `[[ ]]` and not regular builtins, functions, &c) certainly makes bash an oddity among languages, but... well, if it weren't enough of an oddity to be interesting, I'd be following a language tag in the LISP family and helping folks work with a language that's actually well-designed, instead of over here. :) – Charles Duffy Apr 21 '16 at 19:43
  • @sln, ...and re: "regular expressions are a language, not a string", you do realize that a string can be parsed as content in language, no? What's the point of the hair you're splitting? Designing boundaries between an outer language and an inner DSL (as regexes are) in such a way as to avoid needing to complicate the outer language's rules by adding exceptions to it strikes me as a feature, not a bug -- and is why I *strongly* prefer the approach to regexes taken by Python, C or Java (aka "just library functionality") to that taken by Ruby, Perl, etc. – Charles Duffy Apr 21 '16 at 19:45
  • @CharlesDuffy - It's entirely unclear how escaping works in this context. There is a before and after affect. If it's a strict un-escape of each escape, then all meta and special regex meaning need to be escaped, i.e. `\w` to the engine, must now be `\\w` in bash. However, this means `\]]` to the engine, must now be `\\]]` to bash. But under the un-escape rules, bash see's `\\ ` a double escape, then `]]` a regex delimiter. In regex land, the `]` is always the end of a class that needs to be escaped if literal. Even if the rules are different, I can example a failure / conceptual error. –  Apr 22 '16 at 19:09
  • @sln, using the two-stage approach, the assignment follows regular string-escaping rules, and the compilation as a regular expression follows standard regex rules -- having normal language parsing rules apply on the outside being one of the advantages of the regexes-as-library-functionality-only approach in general. This also means that someone not familiar with bash rules can check w/ `printf '%s\n' "$re"` or such between the stages to confirm contents, with assurance that the output is precisely what the standard `regcomp()` call will be processing. – Charles Duffy Apr 22 '16 at 19:46
  • @sln, ...so: `re=']]'; [[ $str =~ $re ]]` -- nothing ambiguous whatsoever. (You mention `\w`, by the way, but this is POSIX ERE, not PCRE, so that's not a primitive that exists here). – Charles Duffy Apr 22 '16 at 20:19
  • @CharlesDuffy - Sure, there is a way to pre-assign a regex string to a variable, but my point is that `[[ .. ]]` is an especially bad language delimiter construct to parse out a stringed regex. There is apparently _nothing_ in the escaping rules for that construct, to separate `\\]]` into `\]]`, language to regex. It's not uncommon either. _Any_ language level that parses escapes will not be able to parse an escaped delimiter. And this is ok, except that regex engines use metacharacters, like `[]` that have to be escaped if a literal. Perl, has the same problem as do others. –  Apr 22 '16 at 20:22
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109950/discussion-between-charles-duffy-and-sln). – Charles Duffy Apr 22 '16 at 20:38
  • @CharlesDuffy - I'm just using the term language as a euphemism for parsing, whether command line or anything else. I don't have bash, but if I did, the first thing I'd try is given `i="]"`. running `[[ $i =~ [\]]]]` then `[[ $i =~ [\\]]]]` then `[[ $i =~ [\\]] ]]` to know how they parse it. My premise is correct though, and it's not an argument question. These are facts. –  Apr 22 '16 at 20:38
  • Eh? Neither of `[[ $i =~ [\]]]]` or `[[ $i =~ [\\]]]]` are valid syntax; `]]` must be its own token, so in both of the above, `[[` is still the active context, no matter how many `]`s may be included in the string which you intend to parse as a regex. The tokenization rules know nothing about regular expressions, **nor should they**. – Charles Duffy Apr 22 '16 at 20:56
  • ...to answer your question: `i=']'; [[ $i =~ [\]] ]]` parses the same as `i=']'; [[ $i =~ []] ]]`; both return true. – Charles Duffy Apr 22 '16 at 21:01
  • @sln, ...btw, re: your sentence, ""There is apparently nothing in the escaping rules for that construct, to separate \\]] into \]], language to regex."", I simply do not understand its meaning; it doesn't parse to me as a meaningful English sentence. What do you mean by "nothing in the escaping rules for that construct"? Do you mean to say that the language's usual escaping rules don't apply? This would be incorrect, if so. What do you mean by "language to regex"? Do you mean that the boundary is poorly-defined? This would be incorrect, if so. – Charles Duffy Apr 22 '16 at 21:05
  • I suspect that I may understand the cause of our miscommunication here: Do you expect `[[ $foo =~ bar baz ]]` to be valid? It is not: The right-hand side of `=~` must be exactly one word (in the formal sense defined by the bash parser). It is thus not `]]` that terminates a (string to be later interpreted as a) regex at all; this is merely a human-readability construct, and `[[` could still exist without it, just as `test` exists as a synonym to `[` that doesn't require `]` at the end. – Charles Duffy Apr 22 '16 at 21:11
  • I think I got ya. This construct `[[ $i =~ $re ]] ` are three inner tokens separated by a space. Simple.. –  Apr 22 '16 at 23:02
  • Entirely correct -- and that's also true for `[[ $i =~ foo\ bar ]]`, in which case that last inner token is `foo bar`, parsed as a single word. Compare to `[[ $i =~ foo bar ]]`, in which case the word used as a regex is `foo`, and there's an extra, unexpected word `bar` on the end making the test syntax invalid. – Charles Duffy Apr 22 '16 at 23:48
  • @CharlesDuffy - Yeah, don't know what I was thinking, thanks. –  Apr 25 '16 at 22:08