1

Linking back to my previous question, I found the problem not to be entirely solved. Here's the problem:

I have directories named RUN1, RUN2, and RUN3 Each directory has some files. Directory RUN1 has files mod1_1.csv, mod1_2.csv, mod1_3.csv. Directory RUN2 has files mod2_1.csv, mod2_2.csv, mod3_3.csv, etc.

The contents of mod1_1.csv file look like this:

5.71 6.66 5.52 6.90
5.78 6.69 5.55 6.98
5.77 6.63 5.73 6.91

And mod1_2.csv looks like this:

5.73 6.43 5.76 6.57
5.79 6.20 5.10 7.01
5.71 6.21 5.34 6.81

In RUN2, mod2_1.csv looks like this:

5.72 6.29 5.39 5.59
5.71 6.10 5.10 7.34
5.70 6.23 5.23 6.45

And mod2_2.csv looks like this:

5.72 6.29 5.39 5.69
5.71 6.10 5.10 7.32
5.70 6.23 5.23 6.21

My goal is to obtain the line with the smallest value of column 4 for each RUN* directory, and write that and the model which gave it to a new .csv file. Right now, I have this code:

#!/bin/bash
resultfile="best_results_mlp_2.txt"
for d in $(find . -type d -name 'RUN*' | sort);
do
  find $d -type f -name 'mod*' -exec sort -k4 {} -g \; | head -1 >> "$resultfile"
done

But it doesn't always return the smallest value of column 4 (I went through the files and checked), and it doesn't include the file name that contains the smallest number. To clarify, I would like a .csv file with these contents:

5.73 6.43 5.76 6.57 mod1_2.csv
5.72 6.29 5.39 5.59 mod2_1.csv
Community
  • 1
  • 1
StatsSorceress
  • 3,019
  • 7
  • 41
  • 82

1 Answers1

0

If you would like to get the smallest value from all files, you will have to sort all their content at once. The command currently sorts file by file, so you get the smallest value in the first sorted file.

Check the difference between

find "$d" -type f -name 'mod*' -exec sort -k4 -g {} + 

and

find "$d" -type f -name 'mod*' -exec sort -k4 -g {} \;

Also it is recommended to use -n instead of -g unless you really need to. Check --general-numeric-sort section of info coreutils 'sort invocation' for more details why.

Edit: Just checked the link to your previous question and I see now that you need to use --general-numeric-sort

That said, here's a way to get the corresponding filename into the lines, so that you have it in the output:

find "$d" -type f -name 'mod*' -exec awk '{print $0, FILENAME}' {} \;|sort -k4 -g |head -1 >> "$resultfile"

Essentially awk is invoked for each file separately. Awk print each line of the file, appending the corresponding file name to it. Then all those lines are passed for sorting.

Note: The above will print the filename with its path under which find found it. If you are looking to get only the file's basename, you can use the following awk command instead (the rest stays the same as above):

awk 'FNR==1{ cnt=split(FILENAME, arr, "/"); basename=arr[cnt] } { print $0, basename}'
dgeorgiev
  • 919
  • 7
  • 22
  • Is there a way to get the entire path instead of just the path from this directory on? For example, if my bash file is stored in /dir1/dir2, and I want the path /dir1/dir2/RUN1/mod1_1.csv as the last column in my new .csv file. – StatsSorceress Mar 13 '17 at 18:48
  • Or a subset of the current path, like /dir2/RUN1/mod1_1.csv as the last column in my new .csv file. – StatsSorceress Mar 13 '17 at 18:54
  • `find` passes the file with the path under which it was found. So if you search in `"/path/to/$d"`, you will get `"/path/to/$d/filename.csv"`. Just make `find` search in the path you would like to get. – dgeorgiev Mar 14 '17 at 10:02