Sort multiple csv files within a directory based on two columns

Question

I have multiple .csv files in a directory called mydirectory. I want to sort all these files using some bash/awk/sed command first based on LeftChr column and then RightChr column and get the result.

  >Id  LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
1979     chr1        825881           -         252      chr2        5726723            -
5480     chr2        826313           +         444      chr2        5727501            +
5492     chr5        869527           +         698      chr2         870339            +
1980     chr2       1584550           -         263      chr1        1651034            -
5491    chr14       1685863           +         148      chr1        1686679            +
5490     chr1       1691382           +         190      chr1        1693020            +

result

  >Id  LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
5490     chr1       1691382           +         190      chr1        1693020            +
1979     chr1        825881           -         252      chr2        5726723            -
1980     chr2       1584550           -         263      chr1        1651034            -
5480     chr2        826313           +         444      chr2        5727501            +
5492     chr5        869527           +         698      chr2         870339            +
5491    chr14       1685863           +         148      chr1        1686679            +

Read your file into R. And then df[, order(df$LeftChr, df$RightChr)] — Ven Yao, Dec 04 '15 at 06:16

potong · Answer 1 · 2015-12-04T21:19:03.803

This might work for you (GNU sed and sort):

sed '1b;/Id/d;s/chr//g' mydirectory/*.csv |
sort -k2,2n -k6,6n |
sed '1b;s/\S\+/chr&/2;s/\S\+/chr&/6' > outputFile

This drops all but the first header and removes the literal chr from all files. The ensuing file is piped into a sort which sorts the file by the second and sixth fields numerically. This inturn is piped into a final sed command which ignores the first line (header line) and replaces the literal chr in the second and sixth fields.

bian · Answer 2 · 2015-12-04T09:33:04.973

1

awk 'h{NF+=0;print |"sort -t\" \" -k2.4n -k6.4n"}!h{print;h=1}' file | column -t
Id    LeftChr  LeftPosition  LeftStrand  LeftLength  RightChr  RightPosition  RightStrand
5490  chr1     1691382       +           190         chr1      1693020        +
1979  chr1     825881        -           252         chr2      5726723        -
1980  chr2     1584550       -           263         chr1      1651034        -
5480  chr2     826313        +           444         chr2      5727501        +
5492  chr5     869527        +           698         chr2      870339         +
5491  chr14    1685863       +           148         chr1      1686679        +

Yes ah, this pattern does not become a

edited Dec 04 '15 at 09:33

answered Dec 04 '15 at 07:49

bian

1,456
8
7

I liked this answer and so do not want to add my own... but as we need to do a directory of files... Wouldn't it be better to do: awk -v h=0 '!h { print; h=1 } NR>1 { print | "sort ... followed by your very sweet sort-fu? Why do we not want to to put column -t outside the awk script? – Michael Back Dec 04 '15 at 09:11
You are right. `column -t` out... – bian Dec 04 '15 at 09:33
Consider that we do not want to print the header in subsequent files; therefore, I'm pretty sure we want to use NR>1{ print | "sort ... } instead of h { print | "sort ... }. I also think NF+=0 doesn't do anything. So... Suggest: awk 'NF>1{print |"sort -t\" \" -k2.4n -k6.4n"}!h{print;h=1}' FILE1 FILE2 FILE3 | column -t ... What do you think? – Michael Back Dec 04 '15 at 10:29
`NF+=0` This sort needed. You can remove it test. Other you can modify... – bian Dec 04 '15 at 10:33

score 0 · Answer 3 · answered Dec 04 '15 at 07:15

Assuming you have access to a reasonable computing environment, the following should provide the foundation for what you are trying to do:

in=input.txt; head -n 1 "$in"; tail -n +2 "$in" | sort -k2,2 -k6,6

There are several potential issues, though. One is that the input file you have posted is not a "CSV" file in the usual sense. Another is whether you want a "stable sort" or not.

score 0 · Answer 4 · answered Dec 04 '15 at 07:19

0

load it into r

result <- yourdataname[order(,yourdataname[,LeftChr], yourdataname[,RightChr])]

if you have NAs in the dataset:

result <- yourdataname[order(yourdataname[,"LeftChr"],yourdataname[,"RightChr"], na.last = NA),]

answered Dec 04 '15 at 07:19

Gnusper

213
2
7

score 0 · Answer 5 · answered Dec 04 '15 at 08:47

0

None of the above answer worked for me, but was able to get this done with something like this.

for x in *.csv; do grep -v "^>" *.csv | sort -k2,2V -k6,6V -k3,3n -t $','<"$x" >"$x.tmp" mv "$x.tmp" "$x" done

answered Dec 04 '15 at 08:47

MAPK

5,635
4
37
88

Sort multiple csv files within a directory based on two columns

5 Answers5