20

I have a script that is trying to get blocks of information from gparted.

My Data looks like:

Disk /dev/sda: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 2      316MB   38.7GB  38.4GB  primary  ext4
 3      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

log4net.xml
Model: VMware Virtual disk (scsi)
Disk /dev/sdb: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End     Size    Type     File system     Flags
 1      1049kB  316MB   315MB   primary  ext4            boot
 5      316MB   38.7GB  38.4GB  primary  ext4
 6      38.7GB  42.9GB  4228MB  primary  linux-swap(v1)

I use a regex to break this into two Disk blocks

^Disk (/dev[\S]+):((?!Disk)[\s\S])*

This works with multiline on.

When I test this in a bash script, I can't seem to match \s, or \S -- What am I doing wrong?

I am testing this through a script like:

data=`cat disks.txt`
morematches=1
x=0
regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"

if [[ $data =~ $regex ]]; then
echo "Matched"
while [ $morematches == 1 ]
do
        x=$[x+1]
        if [[ ${BASH_REMATCH[x]} != "" ]]; then
                echo $x "matched" ${BASH_REMATCH[x]}
        else
                echo $x "Did not match"
                morematches=0;
        fi

done

fi

However, when I walk through testing parts of the regex, Whenever I match a \s or \S, it doesn't work -- what am I doing wrong?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
Yablargo
  • 3,520
  • 7
  • 37
  • 58
  • Apparently so.. I guess every other regex engine I've used has been using the perl conventions – Yablargo Aug 29 '13 at 14:53
  • `\s` and `\S` are PCRE extensions; they are not present in the ERE (Posix Extended Regular Expression) standard. Just be glad you aren't trying to use BRE. – Charles Duffy Aug 29 '13 at 14:53
  • 1
    ...by the way, a lot of the PCRE extensions are poorly-thought-out things with absolutely horrid worst-case performance (particularly, lookahead/lookbehind). Choosing to use ERE instead is, as a rule, very much defensible. – Charles Duffy Aug 29 '13 at 14:54
  • 1
    ...see in particular http://swtch.com/~rsc/regexp/regexp1.html – Charles Duffy Aug 29 '13 at 14:56
  • 2
    ...kibitzing on some other points: `x=$[x+1]` is an antique syntax; `((x++))` is the modern bash version, or `x=$((x + 1))` the modern POSIX version. Using `==` inside of `[ ]` is not POSIX-compliant; either use `[[ ]]` (which doesn't try to be POSIX compliant, and allows you to not quote by virtue of having parse-time rules that turn off string-splitting) or use `=` instead of `==` (and make it `[ "$morematches" = 1 ]`, WITH THE QUOTES!). Always quote your expansions: `echo "$x did not match"`; otherwise, globs inside of `$x` are expanded and runs of whitespace compressed. – Charles Duffy Aug 29 '13 at 14:59
  • @Yablargo Your script is actually confusing to what it really wants to do. Do you want to have a message like `/dev/xyz matched 4.9GB`? – konsolebox Aug 29 '13 at 15:03
  • Konsole: This was just to text the regex, I have a larger irrelevant script that does something with /dev/sda1/dev/sda2,etc based on its file system type – Yablargo Aug 29 '13 at 16:34
  • Duffy: Good to know! I don't usually do much in bash shells cripting – Yablargo Aug 29 '13 at 16:34

6 Answers6

28

Perhaps \S and \s are not supported, or that you cannot place them around [ ]. Try to use the following regex instead:

^Disk[[:space:]]+/dev[^[:space:]]+:[[:space:]]+[^[:space:]]+

EDIT

It seems like you actually want to get the matching fields. I simplified the script to this for that.

#!/bin/bash 

regex='^Disk[[:space:]]+(/dev[^[:space:]]+):[[:space:]]+(.*)'

while read line; do
    [[ $line =~ $regex ]] && echo "${BASH_REMATCH[1]} matches ${BASH_REMATCH[2]}."
done < disks.txt

Produces:

/dev/sda matches 42.9GB.
/dev/sdb matches 42.9GB.
konsolebox
  • 72,135
  • 12
  • 99
  • 105
17

Because this is a common FAQ, let me list a few constructs which are not supported in Bash (and related tools like sed, grep, etc), and how to work around them, where there is a simple workaround.

There are multiple dialects of regular expressions in common use. The one supported by Bash is a variant of Extended Regular Expressions. This is different from e.g. what many online regex testers support, which is often the more modern Perl 5 / PCRE variant.

  • Bash doesn't support \d \D \s \S \w \W -- these can be replaced with POSIX character class equivalents [[:digit:]], [^[:digit:]], [[:space:]], [^[:space:]], [_[:alnum:]], and [^_[:alnum:]], respectively. (Notice the last case, where the [:alnum:] POSIX character class is augmented with underscore to be exactly equivalent to the Perl \w shorthand.)
  • Bash doesn't support non-greedy matching. You can sometimes replace a.*?b with something like a[^ab]*b to get a similar effect in practice, though the two are not exactly equivalent.
  • Bash doesn't support non-capturing parentheses (?:...). In the trivial case, just use capturing parentheses (...) instead; though of course, if you use capture groups and/or backreferences, this will renumber your capture groups.
  • Bash doesn't support lookarounds like (?<=before) or (?!after) and in fact anything with (? is a Perl extension. There is no simple general workaround for these, though you can often rephrase your problem into one where lookarounds can be avoided.
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • https://stackoverflow.com/questions/19453991/is-it-possible-to-perform-look-behind-and-look-ahead-assertions-in-grep-without has some ideas for how to reimplement lookarounds. – tripleee Apr 04 '19 at 04:59
  • Perhaps tangentially see also [Why are there so many different regular expression dialects?](https://stackoverflow.com/questions/2298007/why-are-there-so-many-different-regular-expression-dialects) – tripleee Oct 04 '21 at 05:13
  • Bash does support `\s` and others in certain cases, see my answer below. – alecov May 31 '22 at 18:52
4

from man bash

An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is con‐ sidered an extended regular expression and matched accordingly (as in regex(3)).

ERE doesn't support look-ahead/behind. However you have them in your code ((?!Disk)).

That's why your regex won't do match as you expected.

Kent
  • 189,393
  • 32
  • 233
  • 301
  • That, plus the lack of `\s` and `\S`. – Adrian Frühwirth Aug 29 '13 at 14:54
  • @AdrianFrühwirth `\s and \S` should be ok, see my answer, I added that section. – Kent Aug 29 '13 at 15:01
  • 2
    `\s` and `\S` may work in practice, but the bash documentation does not promise that they'll work -- only the ERE syntax parsed by `regex(3)` is guaranteed to be supported, and the POSIX ERE standard does not include these shortcuts. Relying on them is thus... unfortunate and fragile. – Charles Duffy Aug 29 '13 at 15:01
  • Charles is right... on my system I get: `[[ "aaa" =~ "\S+" ]] && echo "yes" || echo "no"` --> `no` – dcsohl Aug 29 '13 at 16:10
  • 1
    @CharlesDuffy sorry, I was testing it in zsh...... you are right, I am removing the \s \S part. – Kent Aug 29 '13 at 16:21
2

Bash supports what regcomp(3) supports on your system. Glibc's implementation does support \s and others, but due to the way Bash quotes stuff on binary operators, you cannot encode a proper \s directly, no matter what you do:

[[ 'a   b' =~ a[[:space:]]+b ]] && echo ok # OK
[[ 'a   b' =~ a\s+b ]] || echo fail        # Fail
[[ 'a   b' =~ a\\s+b ]] || echo fail       # Fail
[[ 'a   b' =~ a\\\s+b ]] || echo fail      # Fail

It is much simpler to work with a pattern variable for this:

pattern='a\s+b'
[[ 'a   b' =~ $pattern ]] && echo ok # OK
alecov
  • 4,882
  • 2
  • 29
  • 55
  • 1
    This is then obviously only true on systems where Bash was compiled with Glibc. For me, it works out of the box on Ubuntu, but not on MacOS. – tripleee Jun 01 '22 at 02:35
0

Also, [\s\S] is equivalent to ., i.e., any character. On my shell, [^\s] works but not [\S].

perreal
  • 94,503
  • 21
  • 155
  • 181
0

I know you already "solved" this, but your original issue was probably as simple as not quoting $regex in your test. ie:

if [[ $data =~ "$regex" ]]; then

Bash variable expansion will simply plop in the string, and the space in your original regex will break test because:

regex="^Disk (/dev[\S]+):((?!Disk)[\s\S])*"
if [[ $data =~ $regex ]]; then

is the equivalent of:

if [[ $data =~ ^Disk (/dev[\S]+):((?!Disk)[\s\S])* ]]; then

and bash/test will have a fun time interpreting a bonus argument and all those unquoted meta-characters.

Remember, bash does not pass variables, it expands them.

Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • This was pretty confusing after my 20 minute crash course ;) I ended up just writing a small perl script that I invoke and that was alot simpler. I hadn't realized that the bash regex conventions were so different as pretty much everything else I have used supports perl-style. – Yablargo Aug 29 '13 at 19:51
  • 2
    This answer isn't actually correct -- `[[` has its own parser-level handling; it treats the content on the right-hand side as a literal string if quoted, and a regex if unquoted; it _does not_ perform word-splitting or globbing. This means `regex='.+'; [[ $data =~ $regex ]]` matches any non-empty string, whereas `regex='.+'; [[ $data =~ "$regex" ]]` matches only strings that contain the exact text `.+` within them. – Charles Duffy Aug 12 '20 at 16:57