0

Can you please tell me how to Select rows with min value, including exponential, based on fourth column and group by first column in linux?

Original file

ID,y,z,p-value
1,a,b,0.22
1,a,b,5e-10
1,a,b,1.2e-10
2,c,d,0.06
2,c,d,0.003
2,c,d,3e-7
3,e,f,0.002
3,e,f,2e-8
3,e,f,1.0

The file I want is as below.

ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8

Actually this worked fine, so thanks for everybody!

tail -n +2 original_file > txt sort -t, -k 4g txt | awk -F, '!visited[$1]++' | sort -k2,2 -k3,3 >> final_file
user10345633
  • 105
  • 4
  • Welcome to SO please do add your efforts which you have put in order to solve your own problem in your question and let us know then. – RavinderSingh13 Jan 29 '20 at 03:03
  • 1
    Actually I tried as below. But this does not handle "E-8" or those exponential values. tail -n +2 original_file > txt sort -t, -k 4n txt | awk -F, '!visited[$1]++' | sort -k2,2 -k3,3 >> final_file – user10345633 Jan 29 '20 at 04:28
  • Thank you for letting know your efforts, kindly do add them in your question itself. – RavinderSingh13 Jan 29 '20 at 07:02

2 Answers2

0

You can do it fairly easily in awk just by keeping the current record with the minimum 4th field for a given 1st field. You have to handle outputting the header-row and storing the first record to begin the comparison, which you can do by operating on the first record NR==1 (or first in each file processed, FNR==1).

You can store the first minimum in an array indexed by the first field and save the initial record containing values operating on the 2nd record. Then it is just a matter of checking if the first-field is not the same as the last, if so output the minimum record for the last and keep going until you run out of records. (note: this presumes the first-fields appear in increasing order as they do in your file) Then you use the END rule to output the final record.

You can put that together as follows:

awk -F, '
    FNR==1 {print; next}
    FNR==2 {rec=$0; m[$1]=$4; next}
    {
        if ($1 in m) {
            if ($4 < m[$1]) {
                rec=$0
                m[$1]=$4
            }
        }
        else {
            print rec
            rec=$0
            m[$1]=$4
        }
    }
END {
    print rec
}' file

(where your data is in the file file)

If your first field is not in increasing order, then you will need to save the current minimum record in an array as well. (e.g. turn rec into an array indexed by the first-field holding the total record as its value). You would then delay looping over both arrays until the END rule to output the minimum record for each first-field.

Example Use/Output

You can update the filename to match the filename containing your data, and then to test, all you need to do is select-copy the awk expression and middle-mouse paste it into an xterm in the directory containing your file, e.g.

$ awk -F, '
>     FNR==1 {print; next}
>     FNR==2 {rec=$0; m[$1]=$4; next}
>     {
>         if ($1 in m) {
>             if ($4 < m[$1]) {
>                 rec=$0
>                 m[$1]=$4
>             }
>         }
>         else {
>             print rec
>             rec=$0
>             m[$1]=$4
>         }
>     }
> END {
>     print rec
> }' file
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8

Look things over and let me know if you have questions.

David C. Rankin
  • 81,885
  • 6
  • 58
  • 85
  • Thank you so much for your help. But this does not read E-8 or E-10. So this handles only the values as below... I think we need to add -g? ``` ID,y,z,p-value 1,a,b,0.22 2,c,d,0.06 2,c,d,0.003 3,e,f,0.002 3,e,f,1.0 ``` – user10345633 Jan 29 '20 at 04:23
  • I'm a bit confused when you say *"does not read E-8 or E-10"* When I run against your line I get `1,a,b,0.22 2,c,d,0.003 3,e,f,0.002` returned (as it looks like it should be??) Do you mean `"E-8"` alone instead of `1.xxE-8`? Also only one `\`` around text in comments to highlight as code `:)` – David C. Rankin Jan 29 '20 at 04:39
  • Thank you for your quick response. The original file is as below. ` ID,y,z,p-value 1,a,b,0.22 1,a,b,5e-10 1,a,b,1.2e-10 2,c,d,0.06 2,c,d,0.003 2,c,d,3e-7 3,e,f,0.002 3,e,f,2e-8 3,e,f,1.0 ` I tried the codes you taught me, but the results was as below... ` ID,y,z,p-value 1,a,b,0.22 2,c,d,0.003 3,e,f,0.002 ` I wanted to get the results as below. ` ID,y,z,p-value 1,a,b,1.2e-10 2,c,d,3e-7 3,e,f,2e-8 ` – user10345633 Jan 29 '20 at 04:46
  • If you are on Linux, a quick test is to copy the text in your question, then in a terminal `cat << EOF > datafile` press return and paste the text into the terminal. Then on the 1st blank line type `EOF` and press return. That will create a proper ASCII version of your file to use. Now run the `awk` command against `datafile`. If you are using WSL in Win10 -- that works there too. You cannot save with a windows editor (unless you have changed the default line-endings and chosen UTF-8 as your character set - choosing ASCII is safer) – David C. Rankin Jan 29 '20 at 04:56
  • See: [how to convert dos windows newline crlf to unix newline](https://stackoverflow.com/questions/2613800/how-to-convert-dos-windows-newline-crlf-to-unix-newline-lf-in-a-bash-script) and [HTML Encoding (Character Sets)](https://www.w3schools.com/html/html_charset.asp) `em-dash` and `en-dash`. Also [Format-Hex](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/format-hex?view=powershell-7) if on windows. – David C. Rankin Jan 29 '20 at 05:09
  • Run a hexdump program and make sure your `'-'` are not html Unicode dashes (`em-dash` or `en-dash`) instead of ASCII `45 (0x2d)`. Here `awk` will handle the DOS line endings without issue. There is an issue with your data file if you are getting different results. What OS are you on? – David C. Rankin Jan 29 '20 at 05:13
0

A non-awk approach, using GNU datamash:

$ datamash -H -f -t, -g1 min 4 < input.txt | cut -d, -f1-4
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8

(The cut is needed because with the -f option datamash adds a fifth column that's a duplicate of the 4th; without it it'll just show the first and fourth column values. Minor annoyance.)

This does require that your data is sorted on the first column like in your sample.

Shawn
  • 47,241
  • 3
  • 26
  • 60