Bash Regular Expression -- Can't seem to match any of \s \S \d \D \w \W etc

Question

I have a script that is trying to get blocks of information from gparted.

My Data looks like:

Disk /dev/sda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 2      316MB   38.7GB  38.4GB  primary  ext4
 3      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

log4net.xml
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 5      316MB   38.7GB  38.4GB  primary  ext4
 6      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

I use a regex to break this into two Disk blocks

^Disk (/dev[\S]+):((?!Disk)[\s\S])*

This works with multiline on.

When I test this in a bash script, I can't seem to match \s, or \S -- What am I doing wrong?

I am testing this through a script like:

data=`cat disks.txt`
morematches=1
x=0
regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"

if [[ $data =~ $regex ]]; then
echo "Matched"
while [ $morematches == 1 ]
do
        x=$[x+1]
        if [[ ${BASH_REMATCH[x]} != "" ]]; then
                echo $x "matched" ${BASH_REMATCH[x]}
        else
                echo $x "Did not match"
                morematches=0;
        fi

done

fi

However, when I walk through testing parts of the regex, Whenever I match a \s or \S, it doesn't work -- what am I doing wrong?

Apparently so.. I guess every other regex engine I've used has been using the perl conventions — Yablargo, Aug 29 '13 at 14:53
`\s` and `\S` are PCRE extensions; they are not present in the ERE (Posix Extended Regular Expression) standard. Just be glad you aren't trying to use BRE. — Charles Duffy, Aug 29 '13 at 14:53
...by the way, a lot of the PCRE extensions are poorly-thought-out things with absolutely horrid worst-case performance (particularly, lookahead/lookbehind). Choosing to use ERE instead is, as a rule, very much defensible. — Charles Duffy, Aug 29 '13 at 14:54
...see in particular http://swtch.com/~rsc/regexp/regexp1.html — Charles Duffy, Aug 29 '13 at 14:56
...kibitzing on some other points: `x=$[x+1]` is an antique syntax; `((x++))` is the modern bash version, or `x=$((x + 1))` the modern POSIX version. Using `==` inside of `[ ]` is not POSIX-compliant; either use `[[ ]]` (which doesn't try to be POSIX compliant, and allows you to not quote by virtue of having parse-time rules that turn off string-splitting) or use `=` instead of `==` (and make it `[ "$morematches" = 1 ]`, WITH THE QUOTES!). Always quote your expansions: `echo "$x did not match"`; otherwise, globs inside of `$x` are expanded and runs of whitespace compressed. — Charles Duffy, Aug 29 '13 at 14:59
@Yablargo Your script is actually confusing to what it really wants to do. Do you want to have a message like `/dev/xyz matched 4.9GB`? — konsolebox, Aug 29 '13 at 15:03
Konsole: This was just to text the regex, I have a larger irrelevant script that does something with /dev/sda1/dev/sda2,etc based on its file system type — Yablargo, Aug 29 '13 at 16:34
Duffy: Good to know! I don't usually do much in bash shells cripting — Yablargo, Aug 29 '13 at 16:34

konsolebox · Accepted Answer · 2021-03-30T15:07:28.720

28

Perhaps \S and \s are not supported, or that you cannot place them around [ ]. Try to use the following regex instead:

^Disk[[:space:]]+/dev[^[:space:]]+:[[:space:]]+[^[:space:]]+

EDIT

It seems like you actually want to get the matching fields. I simplified the script to this for that.

#!/bin/bash 

regex='^Disk[[:space:]]+(/dev[^[:space:]]+):[[:space:]]+(.*)'

while read line; do
    [[ $line =~ $regex ]] && echo "${BASH_REMATCH[1]} matches ${BASH_REMATCH[2]}."
done < disks.txt

Produces:

/dev/sda matches 42.9GB.
/dev/sdb matches 42.9GB.

edited Mar 30 '21 at 15:07

answered Aug 29 '13 at 14:49

konsolebox

72,135
12
99
105

1

`[[:alnum:]]` and `[[:digit:]]` would probably be better than the "^space" constructs (even though those match what the OP asked for). – Mat Aug 29 '13 at 14:57
@Mat Yes it could be an option too :) – konsolebox Aug 29 '13 at 15:10

tripleee · Answer 2 · 2022-05-23T05:29:45.220

Because this is a common FAQ, let me list a few constructs which are not supported in Bash (and related tools like sed, grep, etc), and how to work around them, where there is a simple workaround.

There are multiple dialects of regular expressions in common use. The one supported by Bash is a variant of Extended Regular Expressions. This is different from e.g. what many online regex testers support, which is often the more modern Perl 5 / PCRE variant.

Bash doesn't support \d \D \s \S \w \W -- these can be replaced with POSIX character class equivalents [[:digit:]], [^[:digit:]], [[:space:]], [^[:space:]], [_[:alnum:]], and [^_[:alnum:]], respectively. (Notice the last case, where the [:alnum:] POSIX character class is augmented with underscore to be exactly equivalent to the Perl \w shorthand.)
Bash doesn't support non-greedy matching. You can sometimes replace a.*?b with something like a[^ab]*b to get a similar effect in practice, though the two are not exactly equivalent.
Bash doesn't support non-capturing parentheses (?:...). In the trivial case, just use capturing parentheses (...) instead; though of course, if you use capture groups and/or backreferences, this will renumber your capture groups.
Bash doesn't support lookarounds like (?<=before) or (?!after) and in fact anything with (? is a Perl extension. There is no simple general workaround for these, though you can often rephrase your problem into one where lookarounds can be avoided.

https://stackoverflow.com/questions/19453991/is-it-possible-to-perform-look-behind-and-look-ahead-assertions-in-grep-without has some ideas for how to reimplement lookarounds. — tripleee, Apr 04 '19 at 04:59
Perhaps tangentially see also [Why are there so many different regular expression dialects?](https://stackoverflow.com/questions/2298007/why-are-there-so-many-different-regular-expression-dialects) — tripleee, Oct 04 '21 at 05:13
Bash does support `\s` and others in certain cases, see my answer below. — alecov, May 31 '22 at 18:52

Kent · Answer 3 · 2013-08-29T16:22:13.187

4

from man bash

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is con‐ sidered an extended regular expression and matched accordingly (as in regex(3)).

ERE doesn't support look-ahead/behind. However you have them in your code ((?!Disk)).

That's why your regex won't do match as you expected.

edited Aug 29 '13 at 16:22

answered Aug 29 '13 at 14:53

Kent

189,393
32
233
301

That, plus the lack of `\s` and `\S`. – Adrian Frühwirth Aug 29 '13 at 14:54
@AdrianFrühwirth `\s and \S` should be ok, see my answer, I added that section. – Kent Aug 29 '13 at 15:01
2

`\s` and `\S` may work in practice, but the bash documentation does not promise that they'll work -- only the ERE syntax parsed by `regex(3)` is guaranteed to be supported, and the POSIX ERE standard does not include these shortcuts. Relying on them is thus... unfortunate and fragile. – Charles Duffy Aug 29 '13 at 15:01
Charles is right... on my system I get: `[[ "aaa" =~ "\S+" ]] && echo "yes" || echo "no"` --> `no` – dcsohl Aug 29 '13 at 16:10
1

@CharlesDuffy sorry, I was testing it in zsh...... you are right, I am removing the \s \S part. – Kent Aug 29 '13 at 16:21

score 2 · Answer 4 · answered May 31 '22 at 18:51

Bash supports what regcomp(3) supports on your system. Glibc's implementation does support \s and others, but due to the way Bash quotes stuff on binary operators, you cannot encode a proper \s directly, no matter what you do:

[[ 'a   b' =~ a[[:space:]]+b ]] && echo ok # OK
[[ 'a   b' =~ a\s+b ]] || echo fail        # Fail
[[ 'a   b' =~ a\\s+b ]] || echo fail       # Fail
[[ 'a   b' =~ a\\\s+b ]] || echo fail      # Fail

It is much simpler to work with a pattern variable for this:

pattern='a\s+b'
[[ 'a   b' =~ $pattern ]] && echo ok # OK

This is then obviously only true on systems where Bash was compiled with Glibc. For me, it works out of the box on Ubuntu, but not on MacOS. — tripleee, Jun 01 '22 at 02:35

score 0 · Answer 5 · answered Aug 29 '13 at 15:02

0

Also, [\s\S] is equivalent to ., i.e., any character. On my shell, [^\s] works but not [\S].

answered Aug 29 '13 at 15:02

perreal

94,503
21
155
181

`[^\s]` doesn't do what you think, it just matches a string which isn't `s` – tripleee Sep 18 '18 at 11:30

score 0 · Answer 6 · answered Aug 29 '13 at 19:38

0

I know you already "solved" this, but your original issue was probably as simple as not quoting $regex in your test. ie:

if [[ $data =~ "$regex" ]]; then

Bash variable expansion will simply plop in the string, and the space in your original regex will break test because:

regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"
if [[ $data =~ $regex ]]; then

is the equivalent of:

if [[ $data =~ ^Disk (/dev[\S]+):((?!Disk)[\s\S])* ]]; then

and bash/test will have a fun time interpreting a bonus argument and all those unquoted meta-characters.

Remember, bash does not pass variables, it expands them.

answered Aug 29 '13 at 19:38

Sammitch

30,782
7
50
77

This was pretty confusing after my 20 minute crash course ;) I ended up just writing a small perl script that I invoke and that was alot simpler. I hadn't realized that the bash regex conventions were so different as pretty much everything else I have used supports perl-style. – Yablargo Aug 29 '13 at 19:51
2

This answer isn't actually correct -- `[[` has its own parser-level handling; it treats the content on the right-hand side as a literal string if quoted, and a regex if unquoted; it _does not_ perform word-splitting or globbing. This means `regex='.+'; [[ $data =~ $regex ]]` matches any non-empty string, whereas `regex='.+'; [[ $data =~ "$regex" ]]` matches only strings that contain the exact text `.+` within them. – Charles Duffy Aug 12 '20 at 16:57

Bash Regular Expression -- Can't seem to match any of \s \S \d \D \w \W etc

6 Answers6

Linked

Related