2

I have not found a way to use this regex expression - .+?(?=,) in the sed command to extract part of this string (using Lookbehind of the first instance of character).

In plain english I want to extract the part of the string that lies before the first comma. As I'm planning to extract in the future the specific filename of the string, I cannot rely on the cut command (I will have to eventually use sed command) , :-

name='ERROR 1: /home/rphillips/Desktop/empties/BN23_2303.tif, band 1: Failed to compute statistics, no valid pixels found in sampling.'

These are the variations that I've used including a test - sed 's/band/rose/' which worked. However the other variations (shown below) that I've used gave spaces as outputs.

while read -r line; do
    name="$line"
    echo $name
    #file_path=$(echo $name | cut -d "," -f 1)
    #file_path=$(echo $name | sed -e '/s\/.+?(?=,)///')
    #file_path=$(echo $name | sed 's/band/rose/')
    file_path=$(echo $name | sed '/s\/.+?(?=, )///')
    #file_path=$(echo $name | grep -P '.+?(?=,)')
    #file_path=$(echo $name | sed 
    #file_path=$(echo $name | awk '/.+?(?=,)/{print $name}'
    echo $file_path
done < "$filename"

Expected Result - ERROR 1: /home/rphillips/Desktop/empties/BN25_2303.tif

Actual Results - 'lots of spaces'

I've also noticed that the regex expression that I've used have different 'matches' according to the Regex101 Website depending whether I'm using Firefox on Windows or Ubuntu 16.04LTS

Windows - https://regex101.com/r/WWGf8F/1 Ubuntu - https://regex101.com/r/NpL2Oa/1

I'm not sure if this is causing the expression not to be recognized by sed -e?

I have used these references to for the different expressions used in the code above

https://likegeeks.com/regex-tutorial-linux/

How to match "anything up until this sequence of characters" in a regular expression?

https://www.regular-expressions.info/lookaround.html?wlr=1

https://linux.die.net/man/1/sed

Rose
  • 205
  • 3
  • 12
  • 1
    `sed 's/,.*//'` will bring the desired output, although I'm not sure if it is what you want to do. Would you specify the process you want to perform in *English*, because your command `sed '/s\/.+?(?=, )///'` does not work and is not clear what you want to do. Note that I'm *not* the downvoter. – tshiono Jan 07 '19 at 01:50
  • Agreed, `sed 's/,.*$//'` or `sed 's/^\([^,][^,]*\).*$/\1/'`. Either will do what you want. (or `grep -o '^[^,]*'` or `awk -F, '{print $1}'` for that matter) – David C. Rankin Jan 07 '19 at 02:17
  • 2
    `sed` is only guaranteed to support BRE ("POSIX Basic Regular Expressions"), and many versions also offer an extension to access ERE syntax. Lookahead and lookbehind are PCRE extensions, not part of either standard. See http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html – Charles Duffy Jan 07 '19 at 02:32
  • 2
    BTW, `echo $name` is inherently buggy -- see [BashPitfalls #14](http://mywiki.wooledge.org/BashPitfalls#echo_.24foo). Use `<<<"$name"`, `printf '%s\n' "$name"`, or `echo "$name"` *with the quotes*, in that order of preference. – Charles Duffy Jan 07 '19 at 02:34
  • 2
    ...and you don't need `sed` to do something simple like trim everything after the comma in a string at all. if `string=foo,bar`, then `${string%%,*}` will evaluate to just `foo`. – Charles Duffy Jan 07 '19 at 02:47

1 Answers1

4

In plain English I want to extract the part of the string that lies before the first comma. As I'm planning to extract in the future the specific filename of the string, I cannot rely on the cut command (I will have to eventually use sed command)

Input String

ERROR 1: /home/rphillips/Desktop/empties/BN23_2303.tif, band 1: Failed to compute statistics, no valid pixels found in sampling.

Expected Results

ERROR 1: /home/rphillips/Desktop/empties/BN25_2303.tif

Before we get to possible reasons why your sed command isn't working, let's look at your actual problem above. If you simply want to extract the text before the first comma, then all you need is:

sed 's/,.*//'

(which simply says delete everything from the first comma to end)

You can use a backreference as well (which will come in handy to reach your ultimate goal of extracting the filename), e.g.

sed 's/^\([^,][^,]*\).*$/\1/'

(which says '^' start at the beginning, \([^,][^,]*\) capture all text of at least 1 character that is not a comma, and including zero or more additional characters that are not commas, .*$ discarding all text to the end and \1 replacing with only the captured text using a back-reference)

To reach your goal of extracting only the filename, you need only modify the above to begin the capture with the first forward slash, e.g.

sed 's/^[^/]*\([^,][^,]*\).*$/\1/'

Example Use/Output

$ sed 's/^[^/]*\([^,][^,]*\).*$/\1/' <<< $name
/home/rphillips/Desktop/empties/BN23_2303.tif

I'm not sure if this is causing the expression not to be recognized by sed -e?

sed without the -E (--regexp-extended) option uses Basic regular expressions (which does not include look-behind or ahead).

If you plan on using the remaining fields of the comma-separated-values, you may want to consider awk to parse the fields. You can easily obtain all fields specifying the -F field separator and a simple loop.

$ awk -F', ' '{for (i = 1; i <= NF; i++) printf "field %d - %s\n", i, $i}' <<< $name
field 1 - ERROR 1: /home/rphillips/Desktop/empties/BN23_2303.tif
field 2 - band 1: Failed to compute statistics
field 3 - no valid pixels found in sampling.

(you can handle further parsing of each field with a conditional within the loop as well)

In Bash - Parameter Expansions Are All You Need

Not to lose sight of the forest for the trees, since you specified bash, if you simply want to extract the filename from name, all you need is parameter expansion with substring removal (first from the right, and then left), e.g.

tmp=${name%%,*}    ## trim to (and including) the 1st comma from the right
echo "/${tmp#*/}"  ## trim to and including the first / from the left
/home/rphillips/Desktop/empties/BN23_2303.tif

(a much more efficient way to go)

Look things over and let me know if you have further questions.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85