Awk inside of qsub

Question

I have a bash script in which I have a few qsubs. Each of them are waiting for a preivous qsub to be done before starting.

My first qsub consist of sending files in a certain directory to a perl program and having the outfiles printed in a new directory. At the end, I echo the array with all my jobs names. This script works as intented.

mkdir -p /perl_files_dir
for ID_FILES in `ls Infiles_dir/*.txt`;
do
JOB_ID=`echo "perl perl_scirpt.pl $ID_FILES" | qsub -j oe `
JOB_ID_ARRAY="${JOB_ID_ARRAY}:$JOB_ID" 
done
echo $JOB_ID_ARRAY

My second qsub is meant to sort all my previous files made with my perl script in a new outfile and to start after all these jobs are done (about 100 jobs) with depend=afterany. Again, this part is working fine.

SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt  >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`
SORT_ARRAY="${SORT_ARRAY}:$SORT_JOB"

My issue is that in my sorted file, I have a few columns I wish to remove (2 to 6), so I came up with this last line using awk piped to sed with another depend=afterany

SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/     //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`

This last step creates final_file.txt, but leaves it empty. I added SED= before my echo because it would otherwise give me Command not found.

I tried without the pipe so it would just print everything. Unfortunately it prints nothing. I assume it is not opening my sorted file and this is why my final file is empty after my sed. If it's the case, then why won't awk read it?

In my script, I am using variables to define my directories and files (with the correct path). I know my issue is not about find my files or directories since they are perfectly defined at the beginning and used throughout the script. I tried to write the whole path instead of a variable and I get the same results.

score 0 · Answer 1 · answered Aug 06 '13 at 07:25

for ID_FILES in `ls Infiles_dir/*.txt`

Simplify this to

for ID_FILES in Infiles_dir/*.txt

ls lists the files you pass it (except when you pass it directories, then it lists their content). Rather than telling it to display a list of files and parse the output, use the list of files you already have! This is more reliable (parsing the output of ls will fail if the file names contain whitespace or wildcard characters), clearer and faster. Don't parse the output of ls.

SORT_JOB=`echo "sort -m -n perl_files_dir/*.txt  >>sorted_file.txt" | qsub -j oe -W depend=afterany$JOB_ID_ARRAY`

You'd make your life simpler if you used the right form of quoting in the right place. Don't use backquotes, because it's difficult to know how to quote things inside. Use $(…) instead, it's exactly equivalent except that it is parsed in a sane way.

I recommend using a here document for the shell snippet that you're feeding to qsub. You have fewer quoting issues to worry about, and it's more readable.

While we're at it, always put double quotes around variable substitutions and command substitutions: "$some_variable", "$(some_command)". Annoyingly, $var in shell syntax doesn't mean “take the value of the variable var”, it means “take the value of the variable var, parse it as a list of wildcard patterns, and replace each pattern by the list of matching files if there are matching files”. This extra stuff is turned off if the substitution happens inside double quotes (or in a here document, by the way): "$var" means “take the value of the variable var”.

SORT_JOB=$(qsub -j oe -W depend="afterany$JOB_ID_ARRAY" <<'EOF'
sort -m -n perl_files_dir/*.txt  >>sorted_file.txt
EOF
)

We now get to the snippet where the quoting was actually causing a problem.

SED=`echo "awk '{\$2="";\$3="";\$4="";\$5="";\$6=""; print \$0}' sorted_file.txt \
| sed 's/     //g' >final_file.txt" | qsub -j oe -W depend=afterany$SORT_ARRAY`

The string that becomes the argument to the echo command is:

awk '{$2=;$3=;$4=;$5=;$6=; print $0}' sorted_file.txt | sed 's/     //g' >final_file.txt

This is syntactically incorrect, and that's why you're not getting any output.

You didn't escape the double quotes inside what was meant to be the awk snippet. It's a lot clearer if you use a here document. Also, you don't need the SED= part. You added it because you had a command substitution (a command between …), which substitutes the output of a command. But since you aren't interested in the output of the qsub command, don't take its output, just execute it.

qsub -j oe -W depend="afterany$SORT_ARRAY" <<'EOF'
awk '{$2="";$3="";$4="";$5="";$6=""; print $0}' sorted_file.txt |
sed 's/     //g' >final_file.txt
EOF

I'm not familiar with qsub, but presumably there's a way to get the error output and the return status of the commands it runs. Inspect that error output, you should have seen the errors from awk.

score -1 · Answer 2 · edited May 23 '17 at 12:05

The version of awk that I am using, does not like the character escapes

awk --version
GNU Awk 3.1.7

spuder@cent64$ awk '{\$2="";\$3="";\$4=""; print \$0}' foo.txt 
awk: {\$2="";\$3="";\$4=""; print \$0}
awk:  ^ backslash not last character on line

Try the following syntax

awk '{for(i=2;i<=7;i++) $i="";print}' foo.txt

As a side note, if you are using Torque 4.x you may not be able to use a comma separated list of jobs with -W depend=, instead you may need to create a new PBS declarative (-W) for each job.

eg...

#Invalid syntax in newer versions of torque 
qsub -W depend=foo,bar

Resources

backslash in gawk fields
Print all but the first three columns
http://docs.adaptivecomputing.com/torque/help.htm#topics/commands/qsub.htm#-W

Awk isn't seeing those backslashes. The problem is actually that it isn't seeing the `""` either. — Gilles 'SO- stop being evil', Aug 06 '13 at 07:26

Awk inside of qsub

2 Answers2