0

I have multiple .csv files in a directory called mydirectory. I want to sort all these files using some bash/awk/sed command first based on LeftChr column and then RightChr column and get the result.

  >Id  LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
1979     chr1        825881           -         252      chr2        5726723            -
5480     chr2        826313           +         444      chr2        5727501            +
5492     chr5        869527           +         698      chr2         870339            +
1980     chr2       1584550           -         263      chr1        1651034            -
5491    chr14       1685863           +         148      chr1        1686679            +
5490     chr1       1691382           +         190      chr1        1693020            +

result

  >Id  LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
5490     chr1       1691382           +         190      chr1        1693020            +
1979     chr1        825881           -         252      chr2        5726723            -
1980     chr2       1584550           -         263      chr1        1651034            -
5480     chr2        826313           +         444      chr2        5727501            +
5492     chr5        869527           +         698      chr2         870339            +
5491    chr14       1685863           +         148      chr1        1686679            +
MAPK
  • 5,635
  • 4
  • 37
  • 88

5 Answers5

1

This might work for you (GNU sed and sort):

sed '1b;/Id/d;s/chr//g' mydirectory/*.csv |
sort -k2,2n -k6,6n |
sed '1b;s/\S\+/chr&/2;s/\S\+/chr&/6' > outputFile

This drops all but the first header and removes the literal chr from all files. The ensuing file is piped into a sort which sorts the file by the second and sixth fields numerically. This inturn is piped into a final sed command which ignores the first line (header line) and replaces the literal chr in the second and sixth fields.

potong
  • 55,640
  • 6
  • 51
  • 83
1
awk 'h{NF+=0;print |"sort -t\" \" -k2.4n -k6.4n"}!h{print;h=1}' file | column -t
Id    LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
5490  chr1     1691382       +           190         chr1      1693020        +
1979  chr1     825881        -           252         chr2      5726723        -
1980  chr2     1584550       -           263         chr1      1651034        -
5480  chr2     826313        +           444         chr2      5727501        +
5492  chr5     869527        +           698         chr2      870339         +
5491  chr14    1685863       +           148         chr1      1686679        +

Yes ah, this pattern does not become a

bian
  • 1,456
  • 8
  • 7
  • I liked this answer and so do not want to add my own... but as we need to do a directory of files... Wouldn't it be better to do: awk -v h=0 '!h { print; h=1 } NR>1 { print | "sort ... followed by your very sweet sort-fu? Why do we not want to to put column -t outside the awk script? – Michael Back Dec 04 '15 at 09:11
  • You are right. `column -t` out... – bian Dec 04 '15 at 09:33
  • Consider that we do not want to print the header in subsequent files; therefore, I'm pretty sure we want to use NR>1{ print | "sort ... } instead of h { print | "sort ... }. I also think NF+=0 doesn't do anything. So... Suggest: awk 'NF>1{print |"sort -t\" \" -k2.4n -k6.4n"}!h{print;h=1}' FILE1 FILE2 FILE3 | column -t ... What do you think? – Michael Back Dec 04 '15 at 10:29
  • `NF+=0` This sort needed. You can remove it test. Other you can modify... – bian Dec 04 '15 at 10:33
0

Assuming you have access to a reasonable computing environment, the following should provide the foundation for what you are trying to do:

in=input.txt; head -n 1 "$in"; tail -n +2 "$in" | sort -k2,2 -k6,6 

There are several potential issues, though. One is that the input file you have posted is not a "CSV" file in the usual sense. Another is whether you want a "stable sort" or not.

peak
  • 105,803
  • 17
  • 152
  • 177
0

load it into r

result <- yourdataname[order(,yourdataname[,LeftChr], yourdataname[,RightChr])] 

if you have NAs in the dataset:

result <- yourdataname[order(yourdataname[,"LeftChr"],yourdataname[,"RightChr"], na.last = NA),]
Gnusper
  • 213
  • 2
  • 7
0

None of the above answer worked for me, but was able to get this done with something like this.

for x in *.csv; do grep -v "^>" *.csv | sort -k2,2V -k6,6V -k3,3n -t $','<"$x" >"$x.tmp" mv "$x.tmp" "$x" done

MAPK
  • 5,635
  • 4
  • 37
  • 88