-1

I have a huge file of genetic markers for 2890 individuals. I would like to transpose this file. The format of my data is as follows: (I just showed 6 markers here)

ID rs4477212 kgp15297216 rs3131972 kgp6703048 kgp15557302 kgp12112772 ..... 
BV04976 0 0 1 0 0 0 
BV76296 0 0 1 0 0 0 
BV02803 0 0 0 0 0 0 
BV09710 0 0 1 0 0 0 
BV17599 0 0 0 0 0 0 
BV29503 0 0 1 1 0 1 
BV52203 0 0 0 0 0 0 
BV61727 0 0 1 0 0 0 
BV05952 0 0 0 0 0 0 

In fact, I have 1,743,680 columns and 2890 rows in my text file. How to transpose it? I would like the output should be like that:

ID BV04976 BV76296 BV02803 BV09710 BV17599 BV29503 BV52203 BV61727 BV05952  
rs4477212 0 0 0 0 0 0 0 0 0 
kgp15297216 0 0 0 0 0 0 0 0 0 
rs3131972 1 1 0 1 0 1 0 1 0 
kgp6703048 0 0 0 0 0 1 0 0 0 
kgp15557302 0 0 0 0 0 0 0 0 0 
kgp12112772 0 0 0 0 0 1 0 0 0
ysth
  • 96,171
  • 6
  • 121
  • 214
user2872354
  • 23
  • 1
  • 4

2 Answers2

3

I would make multiple passes over the file, perhaps 100, each pass getting 1743680/passes columns, writing out them out (as rows) at the end of each pass.

Assemble the data into strings in an array, not an array of arrays, for lower memory usage and fewer passes. Preallocating the space for each string at the beginning of each pass (e.g. $new_row[13] = ' ' x 6000; $new_row[13] = '';) might or might not help.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • note that this is just the outline of an answer; really giving a good answer would require more information (specifically, the things I ask in comments to the question) – ysth Nov 08 '13 at 04:44
0

(See: An efficient way to transpose a file in Bash )

Have you tried

awk -f tr.awk input.txt > out.txt

where tr.awk is

{ 
    for (i=1; i<=NF; i++) a[NR,i]=$i
}
END {
    for (i=1; i<=NF; i++) {
        for (j=1; j<=NR; j++) {
            printf "%s", a[j,i]
            if (j<NR) printf "%s", OFS
        }
        printf "%s",ORS
    }
}

Probably your file is too big for the above procedure. Then you could try splitting it up first. For example:

#! /bin/bash
numrows=2890
echo "Splitting file.."
split -d -a4 -l1 input.txt
arg=""
outfile="out.txt"
tempfile="temp.txt"
if [ -e $outfile ] ; then
    rm -i $outfile
fi
for (( i=0; i<$numrows; i++ )) ; do
    echo "Processing file: "$(expr $i + 1)"/"$numrows
    file=$(printf "x%04d\n" $i)
    tfile=${file}.tr
    cat $file | tr -s ' ' '\n' > $tfile
    rm $file
    if [ $i -gt 0 ] ; then
        paste -d' ' $outfile $tfile > $tempfile
        rm $outfile
        mv $tempfile $outfile
        rm $tfile
    else
        mv $tfile $outfile
    fi
done

note that split will generate 2890 temporary files (!)

Community
  • 1
  • 1
Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174