0

I am running a script to merge multiple files (14) by columns, the script looks like the following:

cp $file_1 ${out_file}
for i in {2..14};do
    paste -d ',' <(cat ${out_file}) <(cut -d ',' -f 2 ${file}_${i}) > ${out_file}
done

However, strangely, this script will always fail in the 1st run (missing row names), and succeed on a 2nd run. Any ideas what was going on here?

Surely, I can add a copy command to avoid using ${out_file} twice, something like:

cp $file_1 ${out_file}
for i in {2..14};do
    paste -d ',' <(cat ${out_file}) <(cut -d ',' -f 2 ${file}_${i}) > ${out_file}.new
    cat ${out_file}.new > ${out_file}  #This line
done

But is there a better way to do this without copying files in every iteration?

======Here are some details======

file #1:

Name,Simon
Age,30
Sex,Male
Weight,150

file #2:

Name,Mary
Age,27
Sex,Female
Weight,120

I want to merge them to:

Name,Simon,Mary
Age,30,27
Sex,Male,Female
Weight,150,120

If I only run the script once, I will get:

,Simon,Mary
,30,27
,Male,Female
,150,120

After I re-run the script, I will get the correct output:

Name,Simon,Mary
Age,30,27
Sex,Male,Female
Weight,150,120
SimonInNYC
  • 400
  • 3
  • 15
  • 2
    `>` you overwrite the output file each loop, is that intended? `... <(cat ${out_file}) .... > ${out_file}` - you read and write to the same file at the same time, I think your out_file should always be empty. – KamilCuk Aug 05 '19 at 22:04
  • try `...<(cat file) ....> file.new`. Good luck. – shellter Aug 06 '19 at 01:06
  • @KamilCuk, 1. The intention is to merge multiple files, the information from each additional files will be added at the end of the each row. 2. The file is not empty. The only issue is that the script (sometimes) won't pick the first element in each row. – SimonInNYC Aug 06 '19 at 13:52
  • @shellter, thanks for your suggestion. By doing the way you suggested, I guess I will have to copy **file.new** in every iteration to be able to merge everything together. Is there a better way to do this? – SimonInNYC Aug 06 '19 at 13:58
  • 2
    1. `will be added` but `>` _doesn't add_, it truncates the original file. So you do something you don't intent to. 2. The file is not empty because paste reads the second file only. The first file is empty, because it is truncated. So you see only the right part. Exactly what you've described. 3. Tip: looks like you could use `sponge` 4. The line `cp ${out_file}.new #This line` is invalid, where to do you want to copy the file? 5. Why not just `join` the files? Don't you want to `join` the files on name column? Are all the files sorted using name using the same key? – KamilCuk Aug 06 '19 at 14:37
  • @KamilCuk : Good stuff. I would add 6. be aware that `>>` gives you the options to append records to your file. .....Looping with join will get crazy, i.e. `join .... f1 f2 > newF ; join .... newF f3 > newF2; join .... newF2 f4 > newF3; `.Hm.. guess I would rename `newF` after each to something like `FinalList`, then you could skip the "versioning" of `newF`. Good luck to all. – shellter Aug 06 '19 at 15:40
  • @shellter, thanks. What would be my option if I want to ```join``` multiple files besides the one you described: run a loop ```join F f1 >newF; rm F; mv newF F``` ? – SimonInNYC Aug 06 '19 at 19:06
  • @KamilCuk. 1. If it works like what you said, why would it worked out the second time I ran? 2. for point #4. I meant ```cat ${out_file}.new > ${out_file}```, I thought reuse ```${out_file}``` in a same command line as both input and output would work... 3. ```join``` does seem to be a better option (hasn't tried ``sponge``` as it is not available), thanks! – SimonInNYC Aug 06 '19 at 19:16
  • 1
    @GTWu, what you have is a race condition. All parts of a pipeline run *at the same time*, so when you run `cat foo | cat > foo`, the second `cat` runs at the same time the first one does, and `> foo`, deleting the file's contents, happens at the same time the first `cat foo` is still starting up. *Usually*, then, the file will be deleted by `> foo` before `cat foo` is ever able to read it... but because these are things all happening at the same time, sometimes the order can be different; what you have then is a *race condition*, where behavior depends on which process is faster. – Charles Duffy Aug 06 '19 at 19:29
  • ...whereas sometimes it's *possible* to reason out why a race is more likely to be won by one process or the other in specific circumstances, it's much better to just fix your code so it doesn't have them. – Charles Duffy Aug 06 '19 at 19:31

0 Answers0