8

I have this script script.sh:

#!/bin/bash
file_path=$1
result=$(grep -Po 'value="\K.*?(?=")' $file_path)
echo $result

and this file text.txt:

value="a"
value="b"
value="c"

When I run ./script.sh /file/directory/text.txt command, the output in the terminal is the following:

a b c

I understand what the script does, but I don't understand HOW it works, so I need a detailed explanation of this part of command:

-Po 'value="\K.*?(?=")'

If I understood correctly, \K is a Perl command. Can you give me an alternative in shell (for example with awk command)?


Thank you in advance.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Ordinary User
  • 117
  • 2
  • 2
  • 9

1 Answers1

10
  • grep -P enables PCRE syntax. (This is a non-standard extension -- not even all builds of GNU grep support it, as it depends on the optional libpcre library, and whether to link this in is a compile-time option).
  • grep -o emits only matched text, and not the entire line containing said text, in output. (This too is nonstandard, though more widely available than -P).
  • \K is a PCRE extension to regex syntax discarding content prior to that point from being included in match output.

Since your shell is bash, you have ERE support built in. As an alternative that uses only built-in functionality (no external tools, grep, awk or otherwise):

#!/usr/bin/env bash
regex='value="([^"]*)"'                    # store regex (w/ match group) in a variable
results=( )                             # define an empty array to store results
while IFS= read -r line; do             # iterate over lines on input
  if [[ $line =~ $regex ]]; then        # ...and, when one matches the regex...
    results+=( "${BASH_REMATCH[1]}" )   # ...put the group's contents in the array
  fi
done <"$1"                              # with stdin coming from the file named in $1
printf '%s\n' "${results[*]}"           # combine array results with spaces and print

See http://wiki.bash-hackers.org/syntax/ccmd/conditional_expression for a discussion of =~, and http://wiki.bash-hackers.org/syntax/shellvars#bash_rematch for a discussion of BASH_REMATCH. See BashFAQ #1 for a discussion of reading files line-by-line with a while read loop.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Wouldn't it rather be `value="([^"]*)"'` to emulate non-greediness? – Benjamin W. Jun 13 '17 at 15:18
  • @CharlesDuffy You have used result variable as an array. Could you please edit your code so the result variable is of same type of this: `result=$(grep -Po 'value="\K.*?(?=")' $file_path)` (a string I guess)? Sorry for my english – Ordinary User Jun 13 '17 at 15:47
  • @OrdinaryUser, I used an array very intentionally -- if you're going to use your result like `echo $result`, then you're string-splitting it (into words on whitespace characters) and evaluating each of those words generated by the splitting operation as a glob. That's an innately error-prone operation, and it means you can't tell the difference between having one value of `"hello world"` and two separate values, where the first is `hello` and the second is `world`. With an array, the boundary divisions are known and fixed, and you can always go from that array *to* a string later. – Charles Duffy Jun 13 '17 at 15:57
  • @OrdinaryUser, ...so, for instance: `if (( ${#results[@]} )); then printf -v result '%s\n' "${results[@]}"; else result=''; fi` will generate a single string named `result` with boundaries separated by newlines. `echo $result` won't show those newlines, but `echo "$result"` will -- that's probably the best parallel for the original code. – Charles Duffy Jun 13 '17 at 15:58
  • Very, very nice advice for bash only alternate to GNU core tools. – Cymatical Sep 24 '21 at 14:48
  • @Cymatical, this can be needed even on systems that _do_ use GNU tools; the dependency of GNU grep on libpcre (which is used to provide `-P`) is a compile-time option and can be turned off. – Charles Duffy Nov 08 '21 at 15:13