Linux extract text between specific strings

Question

I have multiple files with different job names. The job name is specified as follows.

#SBATCH --job-name=01_job1 #Set the job name

I want to use sed/awk/grep to automatically get the name, that is to say, what follows '--job-name=' and precedes the comment '#Set the job name'. For the example above, I want to get 01_job1. The job name could be longer for several files, and there are multiple = signs in following lines in the file.

I have tried using grep -oP "job-name=\s+\K\w+" file and get an empty output. I suspect that this doesn't work because there is no space between 'name=' and '01_job1', so they must be understood as a single word.

I also unsuccessfully tried using awk '{for (I=1;I<NF;I++) if ($I == "name=") print $(I+1)}' file, attempting to find the characters after 'name='.

Lastly, I also unsuccessfully tried sed -e 's/name=$.*$#Set/\1/' file to find the characters between 'name=' and the beginning of the comment '#Set'. I receive the whole file as my output when I attempt this.

I appreciate any guidance. Thank you!!

Gilles Quénot · Answer 1 · 2023-01-06T00:12:34.493

2

Use this, you was close, just correctness of your grep -oP attempt (the main issue if you are trying to match a space after = character):

$ grep -oP -- '--job-name=\K\S+' file
01_job1

The regular expression matches as follows:

Node	Explanation
`job-name=`	'job-name='
`\K`	resets the start of the match (what is `K`ept) as a shorter alternative to using a look-behind assertion: perlmonks look arounds and Support of K in regex
`\S+`	non-whitespace (all but \n, \r, \t, \f, and " ") (1 or more times (matching the most amount possible))

edited Jan 06 '23 at 00:12

answered Jan 05 '23 at 14:08

Gilles Quénot

173,512
41
224
223

2

For some reason I have to escape the dashes on Macos Terminal. `ggrep -oP '\--job-name=\K\S+'`. ggrep is just GNU grep. – Supertech Jan 05 '23 at 14:46
1

No, check my edited POST with `--` _end of parameters_ – Gilles Quénot Jan 05 '23 at 14:50
What is the meaning of `"--"` – Supertech Jan 05 '23 at 14:52
1

Check edited post again – Gilles Quénot Jan 05 '23 at 14:56
1

That explains why I had to escape the pattern without double dahs. Everything makes sense now. – Supertech Jan 05 '23 at 15:01

score 2 · Answer 2 · answered Jan 05 '23 at 14:12

You need to match the whole string with sed and capture just what you need to get, and use -n option with the p flag:

sed -n 's/.*name=\([^[:space:]]*\).*/\1/p'

See the online demo:

#!/bin/bash
s='#SBATCH --job-name=01_job1           #Set the job name'
sed -n 's/.*name=\([^[:space:]]*\).*/\1/p' <<< "$s"
# => 01_job1

Details:

-n - suppresses default line output
.* - any text
name= - a literal name= string
$[^[:space:]]*$ - Group 1 (\1): any zero or more chars other than whitespace
.* - any text
p - print the result of the successful substitution.

With `awk`, you [could probably use](https://ideone.com/qftmp6) `awk '$2 ~ /^--job-name=/ {gsub(/.*=/, "", $2); print $2}'`, but that depends on assumptions already. — Wiktor Stribiżew, Jan 05 '23 at 14:16

score 2 · Answer 3 · answered Jan 05 '23 at 14:16

2

Simlar to the answer of Gilles Quenot

grep -oP -- '--job-name=\K.*(?= *# *Set the job name)'

This adds a look-ahead to ensure that the string is followed by #Set the job name

answered Jan 05 '23 at 14:16

kvantour

25,269
4
47
72

score 2 · Answer 4 · answered Jan 05 '23 at 14:37

1st solution: In GNU awk with your shown samples please try following awk code.

awk -v RS=' --job-name=\\S+' 'RT && split(RT,arr,"="){print arr[2]}' Input_file

OR a non-one liner form of above GNU awk code would be:

awk -v RS=' --job-name=\\S+' '
RT && split(RT,arr,"="){
   print arr[2]
}
' Input_file

2nd solution: Using any awk please try following code.

awk -F'[[:space:]]+|--job-name=' '{print $3}' Input_file

3rd solution: Using GNU grep please try following code with your shown samples and using non-greedy .*? approach here in regex.

grep -oP '^.*?--job-name=\K\S+' Input_file

dawg · Answer 5 · 2023-01-05T16:04:47.397

You can use a lookbehind and lookahead with GNU grep to get exactly what you describe:

grep -oP '(?<=--job-name=)\S+(?=\s+#Set the job name)' file

Or with awk:

awk '/^#SBATCH[[:space:]]+--job-name=/ && 
     /#Set the job name$/ {
        sub(/^[^=]*=/,"")   
        sub(/#[^#]*$/,"")   
        print
     }' file

Or perl:

perl -lnE 'say $1 if /(?<=--job-name=)(\S+)(?=\s+#Set the job name)/'   file

Any prints:

01_job1

Linux extract text between specific strings

5 Answers5

The regular expression matches as follows: