awk printing nothing when used in loop

Question

I have a bunch of files using the format file.1.a.1.txt that look like this:

A 1
B 2
C 3
D 4

and was using the following command to add a new column containing the name of each file:

awk '{print FILENAME (NF?"\t":"") $0}' file.1.a.1.txt > file.1.a.1.txt

which ended up making them look how I want:

file.1.a.1.txt A 1
file.1.a.1.txt B 2
file.1.a.1.txt C 3
file.1.a.1.txt D 4

However, I need to do this for multiple files as a job on an HPC using sbatch submission. But when I run the following job script:

#!/bin/bash
#<other SBATCH info>
#SBATCH --array=1-10

N=$SLURM_ARRAY_TASK_ID

for j in {a,b,c};
do
    for i in {1,2,3}
    do awk '{print FILENAME (NF?"\t":"") $0}' file.${N}."$j"."$i".txt > file.${N}."$j"."$i".txt
    done
done

awk is generating empty files. I have tried using cat to call the file and then piping it to awk but that also hasn't worked.

Change `file.1.a.1.txt > file.1.a.1.txt` to `file.1.a.1.txt > temp && mv -f temp file.1.a.1.txt` -- you cannot redirect to the file being processed. — David C. Rankin, Jun 11 '20 at 17:34
Welcome to SO, kudos for nice post(which has efforts in form of code + sample of input too), keep it up please. Could you please do let us know if we need to save output into Input_file itself? Also do all your file extensions are `.txt`? — RavinderSingh13, Jun 11 '20 at 17:34
@RavinderSingh13 good comment, if all files can be identified by some glob, then there is no need for a loop -- which when you get the response would make a nice answer while also improving the efficiency of the task by 1000%+ — David C. Rankin, Jun 11 '20 at 17:39
Replace `> file.1.a.1.txt` with `| sponge > file.1.a.1.txt` if `sponge` is available. — Cyrus, Jun 11 '20 at 17:47
Thanks, all file extensions are `.txt`. David's solution worked, though I am still not sure why the original formatting worked in the standalone and not in the job/loop. — Geode, Jun 11 '20 at 17:50

SiegeX · Answer 1 · 2020-06-11T18:01:58.233

1

You don't need a loop and you cannot redirect STDOUT to the same file you're reading from STDIN, you will get blank files if you do that.

Try this:

#!/bin/bash

N=$SLURM_ARRAY_TASK_ID

awk '
   NF{
      print FILENAME "\t" $0 > FILENAME".tmp"
   }
   ENDFILE{ # requires gawk
      close(FILENAME".tmp") 
   }' file."$N".{a,b,c}.{1,2,3}.txt

for file in file*.tmp; do
   mv "$file" "${file%.tmp}"
done

Note that if you don't have GNU awk to use ENDFILE{} you can remove that stanza and get away with either:

Putting the close() statement just after the print statement (comes with lots of overhead)
Don't call close() at all and as long as you don't have a lot of files, you should be fine.

edited Jun 11 '20 at 18:01

answered Jun 11 '20 at 17:51

SiegeX

135,741
24
144
154

This also worked for me, along with David R's suggestion of using `temp && mv -f temp`. I have ~1000 files with ~5M rows and only 2 columns; which solution would be best regarding runtime? – Geode Jun 11 '20 at 18:05
1

Well, you can find out by calling both scripts with the builtin `time` command in front of it. As in `time /path/to/SiegeX/version.sh` and `time /path/to/your/modified/version.sh`. I will say that I would be ***very*** surprised if mine wasn't faster because my version calls [tag:awk] ***one*** time, where as your version calls [tag:awk] once for ***each*** file it operates on. – SiegeX Jun 11 '20 at 19:15
P.S., if you update your question requirements to do what you want without a loop to maximize efficiency, this question will likely be re-opened for more answers. – SiegeX Jun 11 '20 at 19:20

awk printing nothing when used in loop

1 Answers1

Linked