How to transpose a huge txt file with 1,743,680 columns and 2890 rows

Question

I have a huge file of genetic markers for 2890 individuals. I would like to transpose this file. The format of my data is as follows: (I just showed 6 markers here)

ID rs4477212 kgp15297216 rs3131972 kgp6703048 kgp15557302 kgp12112772 ..... 
BV04976 0 0 1 0 0 0 
BV76296 0 0 1 0 0 0 
BV02803 0 0 0 0 0 0 
BV09710 0 0 1 0 0 0 
BV17599 0 0 0 0 0 0 
BV29503 0 0 1 1 0 1 
BV52203 0 0 0 0 0 0 
BV61727 0 0 1 0 0 0 
BV05952 0 0 0 0 0 0

In fact, I have 1,743,680 columns and 2890 rows in my text file. How to transpose it? I would like the output should be like that:

ID BV04976 BV76296 BV02803 BV09710 BV17599 BV29503 BV52203 BV61727 BV05952  
rs4477212 0 0 0 0 0 0 0 0 0 
kgp15297216 0 0 0 0 0 0 0 0 0 
rs3131972 1 1 0 1 0 1 0 1 0 
kgp6703048 0 0 0 0 0 1 0 0 0 
kgp15557302 0 0 0 0 0 0 0 0 0 
kgp12112772 0 0 0 0 0 1 0 0 0

once when data is imported into database, all kinds of file reports are possible and easy to create — mpapec, Nov 07 '13 at 18:15
from the example, you have 7 out of 54 as `1`s. how sparse is the actual data? — ysth, Nov 07 '13 at 22:25

score 3 · Answer 1 · answered Nov 07 '13 at 18:14

3

I would make multiple passes over the file, perhaps 100, each pass getting 1743680/passes columns, writing out them out (as rows) at the end of each pass.

Assemble the data into strings in an array, not an array of arrays, for lower memory usage and fewer passes. Preallocating the space for each string at the beginning of each pass (e.g. $new_row[13] = ' ' x 6000; $new_row[13] = '';) might or might not help.

answered Nov 07 '13 at 18:14

ysth

96,171
6
121
214

note that this is just the outline of an answer; really giving a good answer would require more information (specifically, the things I ask in comments to the question) – ysth Nov 08 '13 at 04:44

score 0 · Answer 2 · edited May 23 '17 at 12:28

(See: An efficient way to transpose a file in Bash )

Have you tried

awk -f tr.awk input.txt > out.txt

where tr.awk is

{ 
    for (i=1; i<=NF; i++) a[NR,i]=$i
}
END {
    for (i=1; i<=NF; i++) {
        for (j=1; j<=NR; j++) {
            printf "%s", a[j,i]
            if (j<NR) printf "%s", OFS
        }
        printf "%s",ORS
    }
}

Probably your file is too big for the above procedure. Then you could try splitting it up first. For example:

#! /bin/bash
numrows=2890
echo "Splitting file.."
split -d -a4 -l1 input.txt
arg=""
outfile="out.txt"
tempfile="temp.txt"
if [ -e $outfile ] ; then
    rm -i $outfile
fi
for (( i=0; i<$numrows; i++ )) ; do
    echo "Processing file: "$(expr $i + 1)"/"$numrows
    file=$(printf "x%04d\n" $i)
    tfile=${file}.tr
    cat $file | tr -s ' ' '\n' > $tfile
    rm $file
    if [ $i -gt 0 ] ; then
        paste -d' ' $outfile $tfile > $tempfile
        rm $outfile
        mv $tempfile $outfile
        rm $tfile
    else
        mv $tfile $outfile
    fi
done

note that split will generate 2890 temporary files (!)

Hi Hakon, Thanks for the script. Please let me know how to introduce my files to this script. — user2872354, Nov 08 '13 at 13:56
I supposed you only had a single file. In the script it is assumed it is in the file `input.txt`.. — Håkon Hægland, Nov 08 '13 at 21:41

How to transpose a huge txt file with 1,743,680 columns and 2890 rows

2 Answers2