4

Been looking for some more advanced regex info on regex with bash and have not found much information on it.

Here's the concept, with a simple string:

myString="DO-BATCH BATCH-DO"

if [[ $myString =~ ([[:alpha:]]*)-([[:alpha:]]*) ]]; then
 echo ${BASH_REMATCH[1]} #first perens
 echo ${BASH_REMATCH[2]} #second perens
 echo ${BASH_REMATCH[0]} #full match
fi

outputs:
BATCH
DO
DO-BATCH

So fine it does the first match (BATCH-DO) but how do I pull a second match (DO-BATCH)? I'm just drawing a blank here and can not find much info on bash regex.

pn1 dude
  • 4,286
  • 5
  • 30
  • 26
  • 1
    It's not clear what you are asking, as "DO-BATCH" does not occur in your string. However, do you mean you'd like to also have `${BASH_REMATCH[3]}` equal to "BATCH", etc? – chepner Jul 19 '12 at 17:26
  • Oops your correct... Edited OP. And yes that is correct -> ${BASH_REMATCH[3]} == "BATCH" and ${BASH_REMATCH[4]} == "DO" – pn1 dude Jul 19 '12 at 18:00

5 Answers5

5

OK so one way I did this is to put it in a for loop:

myString="DO-BATCH BATCH-DO"
for aString in ${myString[@]}; do
    if [[ ${aString} =~ ([[:alpha:]]*)-([[:alpha:]]*) ]]; then
     echo ${BASH_REMATCH[1]} #first perens
     echo ${BASH_REMATCH[2]} #second perens
     echo ${BASH_REMATCH[0]} #full match
    fi
done

which outputs:
DO
BATCH
DO-BATCH
BATCH
DO
BATCH-DO

Which works but I kind of was hoping to pull it all from one regex if possible.

pn1 dude
  • 4,286
  • 5
  • 30
  • 26
  • `perl` supports a notion of repeated matching via the `g` flag of its matching operator `m//`, but to the best of my knowledge `bash` does not have an equivalent. – chepner Jul 19 '12 at 18:38
1

In your answer, myString is not an array, but you use an array reference to access it. This works in Bash because the 0th element of an array can be referred to by just the variable name and vice versa. What that means is that you could use:

for aString in $myString; do

to get the same result in this case.

In your question, you say the output includes "BATCH-DO". I get "DO-BATCH" so I presume this was a typo.

The only way to get the extra strings without using a for loop is to use a longer regex. By the way, I recommend putting Bash regexes in variable. It makes certain types much easier to use (those the contain whitespace or special characters, for example.

pattern='(([[:alpha:]]*)-([[:alpha:]]*)) +(([[:alpha:]]*)-([[:alpha:]]*))'
[[ $myString =~ $pattern ]]
declare -p BASH_REMATCH    #dump the array

Outputs:

declare -ar BASH_REMATCH='([0]="DO-BATCH BATCH-DO" [1]="DO-BATCH" [2]="DO" [3]="BATCH" [4]="BATCH-DO" [5]="BATCH" [6]="DO")'

The extra set of parentheses is needed if you want to capture the individual substrings as well as the hyphenated phrases. If you don't need the individual words, you can eliminate the inner sets of parentheses.

Notice that you don't need to use if if you only need to extract substrings. You only need if to take conditional action based on a match.

Also notice that ${BASH_REMATCH[0]} will be quite different with the longer regex since it contains the whole match.

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • Yes I had edited a typo and forgot to do the output. Thanks. Yes myString is not an array. I had initially made it one but found that it didn't need it for the for loop. I messed around a bit and ended up working with read -a to set the array to a variable. I'm not sure what declare -p BASH_REMATCH would give me except a listing of whats in the array. – pn1 dude Jul 19 '12 at 21:04
  • @pn1dude: Yes, `declare -p BASH_REMATCH` is just a convenient way to dump the contents of the array while testing, for example. – Dennis Williamson Jul 19 '12 at 22:29
1

Per @Dennis Williamson's post I messed around and ended up with the following:

myString="DO-BATCH BATCH-DO" 
pattern='(([[:alpha:]]*)-([[:alpha:]]*)) +(([[:alpha:]]*)-([[:alpha:]]*))'

[[ $myString =~ $pattern ]] && { read -a myREMatch <<< ${BASH_REMATCH[@]}; }

echo "\${myString} -> ${myString}" 
echo "\${#myREMatch[@]} -> ${#myREMatch[@]}"

for (( i = 0; i < ${#myREMatch[@]}; i++ )); do   
  echo "\${myREMatch[$i]} -> ${myREMatch[$i]}" 
done

This works fine except myString must have the 2 values to be there. So I post this because its is kinda interesting and I had fun messing with it. But to get this more generic and address any amount of paired groups (ie DO-BATCH) I'm going to go with a modified version of my original answer:

myString="DO-BATCH BATCH-DO" 
myRE="([[:alpha:]]*)-([[:alpha:]]*)"

read -a myString <<< $myString

for aString in ${myString[@]}; do   
  echo "\${aString} -> ${aString}"  
  if [[ ${aString} =~ ${myRE} ]]; then
    echo "\${BASH_REMATCH[@]} -> ${BASH_REMATCH[@]}"
    echo "\${#BASH_REMATCH[@]} -> ${#BASH_REMATCH[@]}"
    for (( i = 0; i < ${#BASH_REMATCH[@]}; i++ )); do
      echo "\${BASH_REMATCH[$i]} -> ${BASH_REMATCH[$i]}"
    done
  fi
done

I would have liked a perlre like multiple match but this works fine.

pn1 dude
  • 4,286
  • 5
  • 30
  • 26
0

Although this is a year old question (without accepted answer), could the regex pattern be simplified to:

myRE="([[:alpha:]]*-[[:alpha:]]*)"

by removing the inner parenthesis to find a smaller (more concise) set of the words DO-BATCH and BATCH-DO?

It works for me in you 18:10 time answer. ${BASH_REMATCH[0]} and ${BASH_REMATCH[1]} result in the 2 words being found.

David
  • 3,285
  • 1
  • 37
  • 54
0

In case you don't actually know how many matches there will be ahead of time, you can use this:

#!/bin/bash

function handle_value {
  local one=$1
  local two=$2

  echo "i found ${one}-${two}"
}

function match_all {
  local current=$1
  local regex=$2
  local handler=$3

  while [[ ${current} =~ ${regex} ]]; do
    "${handler}" "${BASH_REMATCH[@]:1}"

    # trim off the portion already matched
    current="${current#${BASH_REMATCH[0]}}"
  done
}

match_all \
  "DO-BATCH BATCH-DO" \
  '([[:alpha:]]*)-([[:alpha:]]*)[[:space:]]*' \
  'handle_value'
Lucas
  • 14,227
  • 9
  • 74
  • 124
  • Why doesn't it work if `([[:alpha:]]+)-([[:alpha:]]+)` is used as a regex? – Wiktor Stribiżew Nov 11 '21 at 15:51
  • What doesn't work? The regex you supply would match the first item `DO-BATCH` with `${BASH_REMATCH[1]}` == `DO` and `${BASH_REMATCH[2]}` as `BATCH`, but you would have to run the regex again against the remaining string to get the second value `BATCH-DO`. Which is why i use a `while` loop here. If asking why not use `+` instead of `*` but keep the rest of this code, then you have a good point, `+` probably should be used, but the original example was `*` so have to consider that as possibly intentional support for things like `-SECOND` or `FIRST-` – Lucas Nov 12 '21 at 20:12
  • Ah, got it, we need to add `*`, `current="${current#*${BASH_REMATCH[0]}}"`. See [this online code demo](https://ideone.com/2xcmGS) to see what I mean. – Wiktor Stribiżew Nov 12 '21 at 21:20