Bash compatible regex (with groups)

Question

I'm trying to do a simple script with a regex. This regex works in texts editors and online regex checkers. But I can't find how to make it work on bash.

I need to capture groups, by the way.

Example text:

2020-03-06 10:00:07 Test2: <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soape...
2020-03-06 10:00:13 Test2: <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soape...

This is my script. It reads each line and creates a file named DATE_HOUR.xml filled with the text until the end of the line (after formatting it):

#!/bin/bash
: ${1?"USO: $0 NOMBRE-DEL-ARCHIVO"} #If no args passed

regex="^(\d*-\d*-\d*)\s(\d*:\d*:\d*)\s(\w*): (.*)$" #This one is working on editors

mkdir -p out
while read line
do
   if [[ $line =~ $regex ]] #IT NEVER ENTERS HERE
    then
        date="${BASH_REMATCH[1]}"   #DATE
        time="${BASH_REMATCH[2]}"   #TIME
        time="${time/:/-}"          #REPLACE : with -
        name="${BASH_REMATCH[3]}"   #I DO NOT USE IT BY NOW
        text="${BASH_REMATCH[4]}"   #TEXT
        echo $text | xmllint --format - > out/$date"_"$time.xml
    fi
done < $1

I've tried this regex, but it sure has errors:

regex="^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}) ([[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) ([[a-zA-Z0-9]]{1,}): (*{1,})$"

Thank you.

The regular expressions supported by Bash are [POSIX Extended Regular Expressions](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04) (ERE). See [mkelement0's excellent answer](https://stackoverflow.com/a/35924143/4154375) to [How do I use a regex in a shell script?](https://stackoverflow.com/q/35919103/4154375). — pjh, Mar 10 '20 at 17:59

Tom Fenech · Accepted Answer · 2020-03-11T08:18:55.907

Firstly, you cannot use "Perl-style" shorthand such as \d and \s in Bash. Your final attempt is close but contains a few errors, such as [[a-zA-Z0-9]] (should only have one pair of []) and *{1,} (not 100% clear on what this does but it's not what you want!).

This pattern can be used instead:

regex='([0-9]{4}-[0-9]{2}-[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) ([a-zA-Z0-9]+): (.*)'

I'm using [0-9] to match the digits - you could use [[:digit:]] instead but it doesn't look like you need support for any characters outside the range 0-9. I also replaced \s with a simple space (you could use [[:blank:]] to match spaces or tabs if that's a possibility).

Regarding the anchors ^ and $, you probably don't need them:

^ is only necessary if you want to avoid lines that match the pattern but don't start with it (it looks like all your lines start with it, in which case this wouldn't be needed)
$ is irrelevant as your pattern ends with .* which will consume the whole rest of the line

Testing it out:

$ line='2020-03-06 10:00:07 Test2: <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soape...'
$ regex='([0-9]{4}-[0-9]{2}-[0-9]{2}) ([0-9]{2}:[0-9]{2}:[0-9]{2}) ([a-zA-Z0-9]+): (.*)'
$ [[ $line =~ $regex ]] && echo yes
yes
$ printf '%s\n' "${BASH_REMATCH[@]}"
2020-03-06 10:00:07 Test2: <?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soape...
2020-03-06
10:00:07
Test2
<?xml version="1.0" encoding="UTF-8"?><soapenv:Envelope xmlns:soape...

Thank you. I knew about Perl-style. That's why I tried the second regex with no luck (I forgot about the dot before the asterisk, lol). Also, I just noticed that using `^` and `$` on a string that will always be a single line is not necesary. — Hache_raw, Mar 10 '20 at 19:34
@Hache_raw I edited my answer to add a bit more explanation about the anchors to the start and end of the line. — Tom Fenech, Mar 11 '20 at 08:19

Romeo Ninov · Answer 2 · 2020-03-10T21:20:31.573

1

Instead of having headache with regex why do not try awk:

while read line
do
filename=$(awk '{print $1"_"$2}' <<< "$line")
awk '{$1="";$2="";$3=""; gsub(/^[[:space:]]+/,"",$0); print}'  <<< "$line" |xmllint --format - >out/${filename}.xml
done < $1

If you do not want colon in filename you can replace the line to be:

filename=$(awk '{gsub(/:/,"",$2); print $1"_"$2}' <<< "$line")

What this code do is sample. First it make loop thru all the lines (from your code). Then I assign to filename first and second variables concatenated with underscode.

Next in second awk I assign empty string to first 3 tokens, then gsub replace the spaces (between first and second tokens, between second and third token and third token and forth token) with nothing. If I do not do this some versions of xmllint will complain. Then I print the line. The construction <<< "$line" mean to use to content of $line and create filehandler and use it as input file.

edited Mar 10 '20 at 21:20

answered Mar 10 '20 at 10:15

Romeo Ninov

6,538
1
22
31

1

`cat $line` is wrong here. Use either `echo "$line"` or `awk ... <<< "$line"` instead. – Wiimm Mar 10 '20 at 11:19
1

Wow, nice approach. I upvoted your answer but I'll mark as correct the other one since it uses my code and explains things. Looking at your code I can't even guess what's happening. Two things: 1) I didn't knew that I could use colons in filenames. 2) I don't know why but `print $1_$2` is not printing the underscore. If you could explain a little the code, it would be great! Thanks. – Hache_raw Mar 10 '20 at 19:27
@Hache_raw, code corrected, now should print underscore. If you do not like colons they can be removes. WIll add how to do – Romeo Ninov Mar 10 '20 at 21:16

Bash compatible regex (with groups)

2 Answers2