3

I have a bunch of files using the format file.1.a.1.txt that look like this:

A 1
B 2
C 3
D 4

and was using the following command to add a new column containing the name of each file:

awk '{print FILENAME (NF?"\t":"") $0}' file.1.a.1.txt > file.1.a.1.txt

which ended up making them look how I want:

file.1.a.1.txt A 1
file.1.a.1.txt B 2
file.1.a.1.txt C 3
file.1.a.1.txt D 4

However, I need to do this for multiple files as a job on an HPC using sbatch submission. But when I run the following job script:

#!/bin/bash
#<other SBATCH info>
#SBATCH --array=1-10

N=$SLURM_ARRAY_TASK_ID

for j in {a,b,c};
do
    for i in {1,2,3}
    do awk '{print FILENAME (NF?"\t":"") $0}' file.${N}."$j"."$i".txt > file.${N}."$j"."$i".txt
    done
done

awk is generating empty files. I have tried using cat to call the file and then piping it to awk but that also hasn't worked.

Geode
  • 59
  • 2
  • 2
    Change `file.1.a.1.txt > file.1.a.1.txt` to `file.1.a.1.txt > temp && mv -f temp file.1.a.1.txt` -- you cannot redirect to the file being processed. – David C. Rankin Jun 11 '20 at 17:34
  • Welcome to SO, kudos for nice post(which has efforts in form of code + sample of input too), keep it up please. Could you please do let us know if we need to save output into Input_file itself? Also do all your file extensions are `.txt`? – RavinderSingh13 Jun 11 '20 at 17:34
  • 1
    @RavinderSingh13 good comment, if all files can be identified by some glob, then there is no need for a loop -- which when you get the response would make a nice answer while also improving the efficiency of the task by 1000%+ – David C. Rankin Jun 11 '20 at 17:39
  • Replace `> file.1.a.1.txt` with `| sponge > file.1.a.1.txt` if `sponge` is available. – Cyrus Jun 11 '20 at 17:47
  • Thanks, all file extensions are `.txt`. David's solution worked, though I am still not sure why the original formatting worked in the standalone and not in the job/loop. – Geode Jun 11 '20 at 17:50

1 Answers1

1

You don't need a loop and you cannot redirect STDOUT to the same file you're reading from STDIN, you will get blank files if you do that.

Try this:

#!/bin/bash

N=$SLURM_ARRAY_TASK_ID

awk '
   NF{
      print FILENAME "\t" $0 > FILENAME".tmp"
   }
   ENDFILE{ # requires gawk
      close(FILENAME".tmp") 
   }' file."$N".{a,b,c}.{1,2,3}.txt

for file in file*.tmp; do
   mv "$file" "${file%.tmp}"
done

Note that if you don't have GNU to use ENDFILE{} you can remove that stanza and get away with either:

  1. Putting the close() statement just after the print statement (comes with lots of overhead)
  2. Don't call close() at all and as long as you don't have a lot of files, you should be fine.
SiegeX
  • 135,741
  • 24
  • 144
  • 154
  • This also worked for me, along with David R's suggestion of using `temp && mv -f temp`. I have ~1000 files with ~5M rows and only 2 columns; which solution would be best regarding runtime? – Geode Jun 11 '20 at 18:05
  • 1
    Well, you can find out by calling both scripts with the builtin `time` command in front of it. As in `time /path/to/SiegeX/version.sh` and `time /path/to/your/modified/version.sh`. I will say that I would be ***very*** surprised if mine wasn't faster because my version calls [tag:awk] ***one*** time, where as your version calls [tag:awk] once for ***each*** file it operates on. – SiegeX Jun 11 '20 at 19:15
  • P.S., if you update your question requirements to do what you want without a loop to maximize efficiency, this question will likely be re-opened for more answers. – SiegeX Jun 11 '20 at 19:20