Extract capture group, if it exists, otherwise, just extract the original string

Question

Given a String, I'd like to use a regex to:

if the given String does NOT match regex, return the ENTIRE String
if the given String does match regex, then return ONLY the capture group

Let's say I have the following regex:

hello\s*([a-z]+)

Here are inputs and the return I am looking for:

"well hello" --> "well hello" (regex did not match)
"well hello world extra words" --> "world"
"well hello   world!!!" --> "world"
"well hello \n \n world\n\n\n" --> "world" (should ignore all newlines)
"this string doesn't match at all" --> "this string doesn't match at all"

Limitations: I am only limited to using grep, sed, and awk. egrep, gawk are not available.

> print "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"
world something else

This is the closest I've gotten. A few things:

it is returning other parts of the string
I couldn't get \s* to match, but a regular space works
not exactly sure, but the /p at the end of sed seems to print a newline

Don't combine the input and expected output in one text block as then we can't just copy/paste it to test with as-is. Please [edit] your question to show a block of sample input and then a separate block of the expected output given that input. — Ed Morton, Feb 04 '23 at 15:33
Regarding "egrep, gawk are not available" - `egrep` has been deprecated in favor of `grep -E` for at least a decade so `egrep` not being available isn't a problem. gawk is GNU awk so if that's not available then neither should GNU sed or GNU grep be. The `\s` shorthand for `[[:space:]]` is only available in the GNU versions of grep, sed, and awk, and the `-r` option for sed is only available in GNU sed. — Ed Morton, Feb 04 '23 at 15:36
`-r` is the flag to enable EREs in old version of GNU sed only. If you use `-E` instead of `-r` then it'll work in current versions of GNU sed as well as BSD sed so it'll be much more portable. — Ed Morton, Feb 04 '23 at 15:43
If you want to match `"well hello \n \n world\n\n\n" --> "world"` across newlines using sed then you'd need GNU sed for `-z` to read all lines into memory at once or some convoluted hieroglyphics that add lines to the "hold" space with non-GNU seds. — Ed Morton, Feb 04 '23 at 15:43
Do you want to match that regexp multiple times across the whole of the input or only once? If your input contained `well hello hello word` what should the output be - `hello` or `world` or both? — Ed Morton, Feb 04 '23 at 15:45
What awk are you allowed to use (what is output of `awk --version`)? — Daweo, Feb 04 '23 at 18:42

score 1 · Answer 1 · answered Feb 04 '23 at 12:30

This might work for you (GNU sed):

sed -E 's/\\n/\n/g;/^well hello\s*([a-z]+).*/s//\1/;s/\n/\\n/g' file

Turn \n into real newlines.

Match on lines that begin well hello, followed by zero or more white space, followed by one or more characters a thru z, followed by whatever. If the match is true, return the characters a thru z otherwise return the original string.

markp-fuso · Answer 2 · 2023-02-04T18:59:24.723

Addressing just the issue why parts of the string that shoudln't be printing, are printing ...

Example:

printf "world hello something else\n" | sed -rn "s/hello ([a-z]+)/\1/p"

Actual output : world something else
Desired output:       something

From the sed man page:

 -n, --quiet, --silent
                suppress automatic printing of pattern space

In the example script the 'pattern space' is defined by hello ([a-z]+), so this is the portion of the input that the -n will be applied against; notice there is nothing in this 'pattern space' that addresses any leading/trailing characters in the input line so said leading/trailing characters are not 'suppressed' (ie, they still show up in the output), hence the unwanted world and else.

To have the -n apply to the entire line the 'pattern space' needs to be expanded to cover the entire line; consider:

  hello ([a-z]+)             # does not cover leading/trailing characters
.*hello ([a-z]+)             # covers leading characters; does not cover trailing characters
  hello ([a-z]+).*           # does not cover leading characters; covers trailing characters
.*hello ([a-z]+).*           # covers leading/trailing characters

Updating the script to cover all leading/trailing characters (ie, the entire line of input):

printf "world hello something else\n" | sed -rn "s/.*hello ([a-z]+).*/\1/p"
                                                   ^^              ^^
Actual output: something

dawg · Answer 3 · 2023-02-04T21:09:45.637

Since you have strings vs a file, consider doing this entirely in Bash:

#!/bin/bash

strings=( 'well hello' 
    'well hello world extra words' 
    'well hello   world!!!' 
    'well hello \n\n  world\n\n' 
    "this string doesn't match at all" )

re='hello[[:space:]][[:space:]]*([a-z][a-z]*)'

for x in "${strings[@]}"; do 
    s=$(printf "$x")               # force interpretation of \n
    if [[ $s =~ $re ]]; then 
        printf \""$x"\""=> \"%s\"\n" "${BASH_REMATCH[2]}"
    else
        printf "No match: \"%s\"\n" "$s"
    fi  
done

Prints:

No match: "well hello"
"well hello world extra words"=> "world"
"well hello   world!!!"=> "world"
"well hello 

  world

"=> "world"
No match: "this string doesn't match at all"

(Note: It is possible to use a word boundary assertion in Bash / zsh depending on the platform. This is so 'hello' as a regex only matches the full word 'hello' vs matching 'phellogen' or 'Othello' The word-boundary version that is platform independent would be re='(^|[^[:alnum:]_])hello[[:space:]][[:space:]]*([a-z][a-z]*)' and the captured word is in "${BASH_REMATCH[2]}")

You could also use perl:

for s in "${strings[@]}"; do 
    perl -0777 -nE '/\bhello\s+([a-z]+)/;say $1 ? "\"$_\" => \"$1\"" : "No match: \"$_\""' <<<$(printf "$s")
done

Prints:

No match: "well hello"
"well hello world extra words" => "world"
"well hello   world!!!" => "world"
"well hello 

  world

" => "world"
No match: "this string doesn't match at all"

Or you could use GNU grep:

for s in "${strings[@]}"; do 
    r=$(ggrep -zoP '\bhello\s+\K([a-z]+)' <<<$(printf "$s") | tr -d '\0' )
    [[ -z "$r" ]] && printf "No match: \"$s\"\n" || printf "\"$s\" => \"$r\"\n"
done

Or any awk:

for s in "${strings[@]}"; do 
    awk '{s = s $0 ORS}
    END{
    sub(ORS "$", "", s)
    split(s,fields,"[^[:alpha:]]+")
    for(i=1;i<length(fields);i++){
        if(fields[i]=="hello" && fields[i+1]~/[a-z]+/) {
            printf "\"%s\" => %s\n", s, fields[i+1]
            found=1
            break
        }
    }
    if (!found) printf "Not Found: \"%s\"\n", s
    }' <<<$(printf "$s")
done

My small issue is that it has to be done against a curl output. Specifically, it's happening via remote curl on hosts. We have a CLI that gets us a list of hosts, authenticates on our behalf, and will run commands on every host (thousands). So it is hard to without a one-liner. I'll still give these a try today! — Eric Lingamfelter, Feb 06 '23 at 18:24

score 1 · Answer 4 · answered Feb 04 '23 at 23:22

1

Using GNU sed

$ sed -Ez 's/[a-z ]+hello[ \t]+(\\n ?|\n ?)+?([a-z]+)[^"]*/\2/g' input_file
"well hello"
"world"
"world"
"world"
"this string doesnt match at all"

answered Feb 04 '23 at 23:22

HatLess

10,622
5
14
32

Bohemian · Answer 5 · 2023-02-05T06:25:35.403

0

Use an alternation:

hello\s+([a-z]+)|([\s\S]*)

Then extract groups 1 and 2:

sed -rn "s/hello\s+([a-z]+)|([\s\S]*)/\1\2/p"

The alternation matches left to right, so if the first parts doesn't match, the whole input is matched; one of group 1 or group 2 will be blank.

edited Feb 05 '23 at 06:25

answered Feb 04 '23 at 03:20

Bohemian

412,405
93
575
722

Thank you. I think that solves the strings that don’t match. I’m still struggling with why parts of the string that shouldn’t be printing, are printing. – Eric Lingamfelter Feb 04 '23 at 03:27
That wouldn't produce the desired output from `"well hello \n \n world\n\n\n" --> "world" (should ignore all newlines)`. – Ed Morton Feb 04 '23 at 21:51

Extract capture group, if it exists, otherwise, just extract the original string

5 Answers5