1

I am trying to select names in a file based in their value in column 2. And i am using awk for this instead of going into R for speed, but I am not getting the results I expect.

EDIT:
gshuf -n 20 file.csv
targets,log2FoldChange
TRINITY_GG_37011_c0_g1_i1.mrna1,2.837606866
TRINITY_GG_8817_c1_g1_i1.mrna1,-1.895959897
TRINITY_GG_73755_c2_g1_i1.mrna1,2.23502917
TRINITY_GG_63035_c0_g1_i1.mrna1,2.185122911
TRINITY_GG_111654_c0_g1_i1.mrna1,8.101066537
TRINITY_GG_59126_c0_g1_i4.mrna1,3.482842141
TRINITY_GG_37271_c0_g1_i6.mrna1,-3.046035487
TRINITY_GG_53334_c0_g1_i3.mrna1,-3.96110701
TRINITY_GG_26406_c0_g1_i2.mrna1,9.391942576
TRINITY_GG_113831_c0_g1_i1.mrna1,3.22109874
TRINITY_GG_114771_c0_g1_i7.mrna1,7.109622418
TRINITY_GG_125067_c0_g1_i9.mrna1,23.02443794
TRINITY_GG_32340_c1_g1_i9.mrna1,5.983333292
TRINITY_GG_101900_c0_g1_i1.mrna1,-3.48623125
TRINITY_GG_3412_c0_g1_i2.mrna1,2.515568648
TRINITY_GG_122872_c0_g1_i7.mrna1,9.993553116
TRINITY_GG_18380_c0_g1_i1.mrna1,-4.455484395
TRINITY_GG_69309_c0_g2_i11.mrna1,-6.927214772
TRINITY_GG_68534_c7_g1_i1.mrna1,-3.149415191
TRINITY_GG_95195_c0_g1_i11.mrna1,7.607035309

cat file.csv | wc -l
   10687

#To get >=2.5 
cat file.csv | awk -F, '{if($2>=2.5)print $1}'| wc -l
    3308


#Between -2.5 and 2.5
cat file.csv | awk -F, '{if($2>-2.5 &&  $2 < 2.5)print $1}'| wc -l
    5451

#To get <=2.5 
cat file.csv| awk -F, '{if($2<=-2.5)print $1}'| wc -l
    1929

But I have manually inspected and it's all over the place.

#This should only print when column 2 <= -2.5
cat file.csv | awk -F, '{if($2<=-2.5)print $1,$2}'| head
TRINITY_GG_63049_c0_g1_i1.mrna1 -0.397269608
TRINITY_GG_148283_c0_g1_i1.mrna1 -0.410665303
TRINITY_GG_107346_c0_g1_i3.mrna1 -0.444588319
TRINITY_GG_25844_c1_g1_i1.mrna1 -0.455797238
TRINITY_GG_95_c1_g1_i1.mrna1 -0.467825233
TRINITY_GG_138461_c2_g1_i1.mrna1 -0.471162154
TRINITY_GG_111467_c0_g1_i4.mrna1 -0.473621231

Can anyone suggest what's the problem?

  • 1
    With your shown samples none of the records(lines) are satisfying this condition, could you please do add samples in your question where its satisfying condition, so that we could try to reproduce this once on our side. – RavinderSingh13 Dec 08 '20 at 20:18
  • Can extract from the initial csv file? – Raman Sailopal Dec 08 '20 at 20:18
  • You don't need to cat file through to awk for a start, just use awk ....... filename – Raman Sailopal Dec 08 '20 at 20:19
  • If i just awk the same still happens – Amaranta_Remedios Dec 08 '20 at 20:20
  • awk -F, '{if($2<=-2.5)print $1,$2}' should be awk -F, '$2<=-2.5 { print $1,$2 }' – Raman Sailopal Dec 08 '20 at 20:23
  • I just tried it but the problem persist: I get the same numbers and some values are -0.39 for instance. – Amaranta_Remedios Dec 08 '20 at 20:25
  • @Amaranta_Remedios, could you please check once if you have control M characters by doing `cat -v Input_file`? Let us know how it goes, looks to me that could be the problem here. – RavinderSingh13 Dec 08 '20 at 20:26
  • 1
    So it prints the line like this: ```TRINITY_GG_22477_c0_g1_i61.mrna1,-26.3457648^M TRINITY_GG_81688_c0_g1_i9.mrna1,-26.90588304^M``` – Amaranta_Remedios Dec 08 '20 at 20:28
  • 1
    Seems that the file has dos line endings ie. `\r\n` vs. just `\n` and that seems to cause problems for me at with mawk, awk 20121220 and busybox awk but works with GNU awk. dos2unix removes the `\r` for you or you could try `awk -v RS="\r\n"` (worked with all beforementioned awks). – James Brown Dec 08 '20 at 20:41
  • 1
    Thanks, that did it. I dos2unix the file and now all the commands work. If you write your comment as an answer i can accept it. – Amaranta_Remedios Dec 08 '20 at 20:49

0 Answers0