138

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

Community
  • 1
  • 1
Federico Giorgi
  • 10,495
  • 9
  • 42
  • 56
  • 13
    What makes you think that there would exist a bash script that's going to be faster than a Perl script? This is exactly the kind of problem that Perl excells in. – Mark Pim Nov 13 '09 at 15:16
  • 2
    @mark, if its pure bash, it might to faster than chaining all those cut/sed etc tools together. But then again, if you define "bash" as in combining tools, then just writing an awk script will be comparable to Perl wrt text processing. – ghostdog74 Nov 13 '09 at 15:41
  • Add another for not understanding how perl would be slow here. Slow to write the code? Slow to execute? I genuinely dislike perl, but it does excel at this sort of task. – Corey Porter Nov 13 '09 at 17:25
  • If your columns/fields have a fixed size/width, then you can use Python file seek to avoid reading your file into memory. Do you have fixed column/field sizes/widths? – tommy.carstensen Apr 07 '13 at 23:09
  • 3
    Anyone who thinks a shell script would be faster than awk or perl needs to read http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice so they can understand why that is not the case. – Ed Morton Apr 10 '16 at 15:12

33 Answers33

138
awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • And now to handle row and column labels too? – Jonathan Leffler Nov 13 '09 at 15:54
  • OK - you're correct; your sample data doesn't match the question's sample data, but your code works fine on the question's sample data and gives the required output (give or take blank vs tab spacing). Mainly my mistake. – Jonathan Leffler Nov 13 '09 at 17:20
  • Interesting timings - I agree you see a performance benefit in awk. I was using MacOS X 10.5.8, which does not use 'gawk'; and I was using Perl 5.10.1 (32-bit build). I gather that your data was 10000 lines with 4 columns per line? Anyway, it doesn't matter a great deal; both awk and perl are viable solutions (and the awk solution is neater - the 'defined' checks in my Perl are necessary for warning free runs under strict/warnings) and neither is a slouch and both are likely to be way faster than the original shell script solution. – Jonathan Leffler Nov 16 '09 at 09:43
  • On my original 2.2GB matrix, the perl solution is slightly faster than awk - 350.103s vs. 369.410s I was using perl 5.8.8 64bit – Federico Giorgi Nov 16 '09 at 10:18
  • i am using gawk 3.16a, Perl 5.10.0. – ghostdog74 Nov 16 '09 at 10:27
  • What are the memory requirements of each of the two methods on your 2.2GB matrix/file? – tommy.carstensen Apr 07 '13 at 23:40
  • Which solution(s) did you end up using? – tommy.carstensen Apr 08 '13 at 08:59
  • Result can be piped through `| column -t` to make the result easier to read. – Vytenis Bivainis May 25 '14 at 10:38
  • 1
    Note: `awk has maximum number of fields size=32767.` – zx8754 Jun 05 '14 at 14:56
  • Hi what's the meaning of this line "NF>p { p = NF }"? I didn't get it...thanks a lot – keypoint Oct 03 '15 at 00:38
  • Beautiful, but you might want to point out to non-awkers like me that what goes in test.awk is the bit between the quotes! – daknowles Apr 03 '16 at 00:24
  • I really like the awk solution, but is there an easy way to pipe the output of find, or somehow pass many files at once and output the transposed results of them all together? – user5359531 Apr 05 '16 at 01:36
  • @tommy.carstensen the memory requirements are the same for both solutions since they both read the whole file into memory before printing in the new order – Ed Morton Apr 10 '16 at 14:49
  • 1
    @zx8754 that max number of fields only applies to an old, non-POSIX awk. Possibly the incredibly unfortunately named "nawk". It does not apply to gawk or other modern awks. – Ed Morton Apr 10 '16 at 14:50
  • @keypoint `NF>p { p = NF }` is identifying the max number of fields across all lines in the file just in case not all lines have the same number of fields so the tool can later print the max number of rows. – Ed Morton Apr 10 '16 at 14:51
  • @user5359531 just list all the files you're interested in on the command line: `awk '...' file1 file2 ... fileN`. – Ed Morton Apr 10 '16 at 14:52
  • I used your first `awk` code but after transposing, the leading zeros from 1st column data vanished. Any idea? – Sigur Apr 08 '17 at 02:46
  • Your `awk` solution worked much faster than GNU `datamash` on a very large file that would not fit into memory. – Alex Reynolds Mar 18 '19 at 22:46
  • https://www.geeksforgeeks.org/awk-command-unixlinux-examples/ one of the best article to learn about awk command by @geeksforgeeks – jerinisready May 13 '21 at 07:05
  • Thanks, @EdMorton! I will surely look into it. :-) – jerinisready Aug 17 '21 at 09:13
  • @jerinisready I recommend everyone buy the book because that is the only source of income that the person who provides gawk and all documentation gets in exchange for all of the work they put into helping our community. – Ed Morton Aug 17 '21 at 14:05
76

csvtk

-t uses tab as delimiter and -d specifies delimiter:

$ seq 4|paste - -|csvtk -t transpose
1   3
2   4
$ seq 4|paste -d, - -|csvtk -d, transpose
1,3
2,4

-l (--lazy-quotes) allows double quotes within a field which is not surrounded by double quotes, but I don't think there's any way to disable CSV-style escaping for double quotes in the output:

$ csvtk -t transpose<<<$'aa"bb\t2\n3\t4'
[ERRO] parse error on line 1, column 3: bare " in non-quoted-field
$ csvtk -lt transpose<<<$'aa"bb\t2\n3\t4'
"aa""bb"    3
2   4

Another caveat is that -l doesn't prevent removing double quotes around fields that don't need to be quoted in CSV:

$ csvtk -lt transpose<<<$'"1"\t2\n3\t4'
1   3
2   4

awk

Gawk version which uses arrays of arrays:

tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}

Plain awk version which uses multidimensional arrays (this took about twice as long to run in my benchmark):

tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}

macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.

To use space as a field separator without collapsing sequences of multiple spaces, use FS='[ ]'.

rs

rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.

Use sequences of spaces and tabs as column separator:

rs -T

Use tab as column separator:

rs -c -C -T

Use comma as column separator:

rs -c, -C, -T

-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.

Do not use -t instead of -T, because it selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).

When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:

$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4

rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:

$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,

R

The t function transposes a matrix or a dataframe:

Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'

If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.

quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.

For large input, data.table::fread is usually faster than read.table and data.table::fwrite is faster than write.table:

$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real  0m1.061s
user  0m0.983s
sys   0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null

real  0m0.599s
user  0m0.535s
sys   0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>/dev/null
x being coerced from class: matrix to data.table

real  0m0.375s
user  0m0.296s
sys   0m0.073s

jq

tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}

jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.

Ruby

ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'

The -1 argument to split disables discarding empty fields at the end:

$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]

Function form:

$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4

The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:

$ ruby -e'p" a  \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a  \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]

C++

Compile with cpp -O3 -o tp tp.cpp:

#include<iostream>
#include<vector>
#include<sstream>

using namespace std;

int main(int argc,char**argv){
  vector<vector<string> >table;
  vector<string>row;
  string line,field;
  char sep=argc==1?'\t':argv[1][0];

  while(getline(cin,line)){
    row.clear();
    stringstream str(line);
    while(getline(str,field,sep))row.push_back(field);
    table.push_back(row);
  }

  int rows=table.size(),cols=table[0].size();
  for(int i=0;i<cols;i++){
    cout<<table[0][i];
    for(int j=1;j<rows;j++)cout<<sep<<table[j][i];
    cout<<endl;
  }
}

Benchmark

csvtk was the fastest and jq was the slowest:

$ seq 1e6|awk 'ORS=NR%1e3?"\t":"\n"'>test
$ TIMEFORMAT=%R
$ time ./tp<test>/dev/null # first run of C++ program is slower
0.895
$ time ./tp<test>/dev/null
0.520
$ time csvtk -t transpose<test>/dev/null
0.142
$ time rs -c -C -T<test|sed $'s/\t$//'>/dev/null
0.587
$ time gawk -F\\t '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' test>/dev/null
1.119
$ time awk -F\\t '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}'<test>/dev/null
1.858
$ time jq -R . test|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]'>/dev/null
3.604
$ time ruby -e'STDIN.map{|x|x.chomp.split("\t",-1)}.transpose.each{|x|puts x*"\t"}'<test>/dev/null
0.492
nisetama
  • 7,764
  • 1
  • 34
  • 21
  • 4
    I wasn't familiar with [`rs`](https://packages.debian.org/jessie/rs) -- thanks for the pointer! (The link is to Debian; the upstream appears to be https://www.mirbsd.org/MirOS/dist/mir/rs/) – tripleee Nov 26 '15 at 13:00
  • It looks like it cannot use tabs as separator. I tried '\t' and '^t'. This is not documented. – lalebarde Feb 27 '16 at 09:05
  • 2
    @lalebarde At least in the implementation of `rs` that comes with OS X, `-c` alone sets the input column separator to a tab. – nisetama Mar 05 '16 at 12:20
  • 2
    @lalebarde, try bash's [ANSI-C quoting](https://www.gnu.org/software/bash/manual/bashref.html#ANSI_002dC-Quoting) to get a tab character: `$'\t'` – glenn jackman Apr 10 '16 at 11:51
  • This worked for me when transposing a tab-delimited file: `rs -c$'\t' -C$'\t' -T` – Nathan S. Watson-Haigh Nov 21 '16 at 22:13
  • 3
    This is an extreme case, but for a very large file with many rows like `TTC TTA TTC TTC TTT`, running `rs -c' ' -C' ' -T < rows.seq > cols.seq` gives `rs: no memory: Cannot allocate memory`. This is a system running FreeBSD 11.0-RELEASE with 32 GB of ram. So, my guess is that `rs` puts everything in RAM, which is good for speed, but not for large data. – jrm Jul 06 '17 at 09:35
  • 1
    jq used 21Gb of ram on a 766MB file. I killed it after 40 minutes without any output. – Glubbdrubb Mar 20 '18 at 09:50
  • @jrm Your guess is correct, it says so in the manual: `BUGS The algorithm currently reads the whole file into memory, so files that do not fit in memory will not be reshaped`. – simlev Aug 29 '18 at 08:28
  • @nisetama What is the name of the RedHat package needed to install `rs` ? – SebMa Sep 07 '22 at 14:34
  • If I do `tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";} input` I get `syntax error near unexpected token`. What am I doing wrong? – mugdi Dec 16 '22 at 16:45
  • @mugdi I found that awk version 3.1.7 on a RHEL6 host gave syntax errors on the `a[i][NR]` syntax, while awk version 4.2.1 on a RHEL8 host did not. I suspect that somewhere awk grew some "advanced" syntax with multi-dimensional arrays. – bgoodr Jul 10 '23 at 21:49
33

A Python solution:

python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

import sys

for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
    print(' '.join(c))

This code does assume that every line has the same number of columns (no padding is performed).

Stephan202
  • 59,965
  • 13
  • 127
  • 133
  • 3
    One minor problem here: Replace `l.split()` by `l.strip().split()` (Python 2.7), else the last line of the output is crippled. Works for arbitrary column separators, use `l.strip().split(sep)` and `sep.join(c)` if your separator is stored in variable `sep`. – krlmlr Oct 02 '12 at 04:18
27

Have a look at GNU datamash which can be used like datamash transpose. A future version will also support cross tabulation (pivot tables)

Here is how you would do it with space separated columns:

datamash transpose -t ' ' < file > transposed_file
winni2k
  • 1,460
  • 16
  • 19
pixelbeat
  • 30,615
  • 9
  • 51
  • 60
  • 1
    This answer is very helpful for large files. I faced a 28GB file and `datamash` handeld the transposition very quickly! – mugdi Jan 05 '23 at 16:12
23

the transpose project on sourceforge is a coreutil-like C program for exactly that.

gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.
flying sheep
  • 8,475
  • 5
  • 56
  • 73
  • Thanks for the link. However, it requires too much memory, when dealing with large matrices/files. – tommy.carstensen Apr 08 '13 at 09:41
  • it has arguments for blocksize and fieldsize: try tweaking the `-b` and `-f` arguments. – flying sheep Apr 08 '13 at 14:54
  • Default block size (--block or -b) is 10kb and default field size (--fieldmax or -f) is 64, so that can't be it. I tried. Thanks for the suggestion though. – tommy.carstensen Apr 10 '13 at 16:27
  • 1
    Worked well with a csv of size 2 GB. – discipulus Nov 08 '16 at 03:10
  • 2
    For a matrix file with dimensions roughly 11k by 5k, I found transpose.c to be ~7x faster and ~5x more memory-efficient than the ghostdog74's first awk solution. Also, I found that the "uses almost no memory" awk code from ghostdog74 didn't work properly. Also, watch out for the --limit flag in the transpose.c program, which by default limits the output to dimension 1k by 1k. – ncemami Nov 28 '16 at 06:40
  • In case you don't like sourceforge or it's down, I made a [github mirror](https://github.com/jan-glx/transpose) of the project. – jan-glx Mar 04 '18 at 20:14
17

Pure BASH, no additional process. A nice exercise:

declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line ; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s\t" ${array[$COUNTER]}
  done
  printf "\n" 
done
Fritz G. Mehner
  • 16,550
  • 2
  • 34
  • 41
16

GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!

datamash -W transpose infile > outfile
Manavalan Gajapathy
  • 3,900
  • 2
  • 20
  • 43
Pal
  • 989
  • 10
  • 23
14

There is a purpose built utility for this,

GNU datamash utility

apt install datamash  

datamash transpose < yourfile

Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

nelaaro
  • 3,006
  • 5
  • 38
  • 56
9

Here is a moderately solid Perl script to do the job. There are many structural analogies with @ghostdog74's awk solution.

#!/bin/perl -w
#
# SO 1729824

use strict;

my(%data);          # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
    my(@row) = split /\s+/;
    my($colnum) = 0;
    foreach my $val (@row)
    {
        $data{$rownum}{$colnum++} = $val;
    }
    $rownum++;
    $maxcol = $colnum if $colnum > $maxcol;
}

my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
    for (my $row = 0; $row < $maxrow; $row++)
    {
        printf "%s%s", ($row == 0) ? "" : "\t",
                defined $data{$row}{$col} ? $data{$row}{$col} : "";
    }
    print "\n";
}

With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.


Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:

Osiris JL: time gawk -f tr.awk xxx  > /dev/null

real    0m0.367s
user    0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null

real    0m0.138s
user    0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx  > /dev/null

real    0m1.891s
user    0m0.924s
sys 0m0.961s
Osiris-2 JL: 

Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
7

Assuming all your rows have the same number of fields, this awk program solves the problem:

{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}

In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.

Example:

$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6

$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
 1 4
 2 5
 3 6
6

If you have sc installed, you can do:

psc -r < inputfile | sc -W% - > outputfile
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • 4
    Note that this supports a limited number of lines because `sc` names its columns as one or a combination of two characters. The limit is `26 + 26^2 = 702`. – Thor Nov 08 '12 at 10:38
5

I normally use this little awk snippet for this requirement:

  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
        max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j==NR?RS:FS)
              }
        }' file

This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.

This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.

fedorqui
  • 275,237
  • 103
  • 548
  • 598
3

A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste

#!/usr/bin/perl
use warnings;
use strict;

my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
    chomp $line;
    my @array = split ("\t",$line);
    open OUTPUT, ">temp$." or die ("unable to open output file!");
    print OUTPUT join ("\n",@array);
    close OUTPUT;
    $counter=$.;
}
close INPUT;

# paste files together
my $execute = "paste ";
foreach (1..$counter) {
    $execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
Federico Giorgi
  • 10,495
  • 9
  • 42
  • 56
  • using paste and temp files are just extra unnecessary operations. you can just do manipulation inside memory itself, eg arrays/hashes – ghostdog74 Nov 13 '09 at 17:11
  • 3
    Yep, but wouldn't that mean keeping everything in memory? The files I'm dealing with are around 2-20gb in size. – Federico Giorgi Nov 16 '09 at 11:49
3

The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:

/bin/rm output 2> /dev/null

cols=`head -n 1 input | wc -w` 
for (( i=1; i <= $cols; i++))
do
  awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Simon C
  • 1,977
  • 11
  • 14
3

I'm a little late to the game but how about this:

cat table.tsv | python -c "import pandas as pd, sys; pd.read_csv(sys.stdin, sep='\t').T.to_csv(sys.stdout, sep='\t')"

or zcat if it's gzipped.

This is assuming you have pandas installed in your version of python

O.rka
  • 29,847
  • 68
  • 194
  • 309
2

I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array
declare -a ncols=( )                      # we build a 1-D-array containing number of elements of each row

SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
    ncols[$indexCol]=${#line[@]};
((indexCol++))
if [ ${#line[@]} -gt ${MAXROWS} ]
    then
         MAXROWS=${#line[@]}
    fi    
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))

    done
done < "$1"

for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
  COUNTER=$ROW;
  for (( indexCol=0; indexCol < ${#ncols[@]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
    then
      printf $PADDING
    else
  printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[@]} ]
then
  printf $SEPARATOR
    fi
    COUNTER=$(( COUNTER + ncols[indexCol] ))
  done
  printf "\n" 
done
2

Not very elegant, but this "single-line" command solves the problem quickly:

cols=4; for((i=1;i<=$cols;i++)); do \
            awk '{print $'$i'}' input | tr '\n' ' '; echo; \
        done

Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.

agc
  • 7,973
  • 2
  • 29
  • 50
Felipe
  • 21
  • 2
2

I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:

#!/bin/bash 
declare -a array=( )                      # we build a 1-D-array

read -a line < "$1"                       # read the headline

COLS=${#line[@]}                          # save number of columns

index=0
while read -a line; do
    for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do
        array[$index]=${line[$COUNTER]}
        ((index++))
    done
done < "$1"

for (( ROW = 0; ROW < COLS; ROW++ )); do
  for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do
    printf "%s" ${array[$COUNTER]}
    if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ]
    then
        printf "\t"
    fi
  done
  printf "\n" 
done
dtw
  • 1,785
  • 1
  • 13
  • 13
2

I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:

Row2Trans=number1
Col2Trans=number2

for ((i=1; $i <= Line2Trans; i++));do
    for ((j=1; $j <=Col2Trans ; j++));do
        awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," }  ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
    done
done

paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO
brasofilo
  • 25,496
  • 15
  • 91
  • 179
Another.Chemist
  • 2,386
  • 3
  • 29
  • 43
2

If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:

head -$N file | tail -1 | tr ',' '\n'
tripleee
  • 175,061
  • 34
  • 275
  • 318
2

Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.

Call the input file foo.

  1. If we know foo has four columns:

    for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
    
  2. If we don't know how many columns foo has:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
    

    xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:

    { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
    

    Maximum length of command we could actually use: 2088944

  3. tr & echo:

    for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
    

    ...or if the # of columns are unknown:

    n=$(head -n 1 foo | wc -w)
    for f in $(seq 1 $n); do 
        cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
    done
    
  4. Using set, which like xargs, has similar command line size based limitations:

    for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done
    
agc
  • 7,973
  • 2
  • 29
  • 50
  • 4
    Those would all be orders of magnitude slower than an awk or perl solution and fragile. Read http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice. – Ed Morton Apr 10 '16 at 15:25
  • @EdMorton, thanks, qualifed intro of my answer to address your speed concerns. Re "fragile": not **3)**, and nor the others when the programmer *knows* the data is safe for a given technique; and isn't POSIX compatible shell code a more stable standard than **perl**? – agc Apr 10 '16 at 18:17
  • sorry, idk much about perl. In this case the tool to use would be `awk`. `cut`, `head`, `echo`, etc. are no more POSIX compatible shell code than an `awk` script is - they all are standard on every UNIX installation. There's simply no reason to use a set of tools that in combination require you to be careful about the contents of your input file and the directory you execute the script from when you can just use awk and the end result is faster as well as more robust. – Ed Morton Apr 10 '16 at 19:12
  • Please, I'm not anti-**awk**, but conditions vary. Reason #1: `for f in cut head xargs seq awk ; do wc -c $(which $f) ; done` When storage is too slow or IO is too low, bigger interpreters make things worse no matter how good they'd be under more ideal circumstances. Reason #2: **awk**, (or most any language), also suffers from a steeper learning curve than a small util designed to do one thing well. When run-time is cheaper than coder man hours, easy coding with "software tools" saves money. – agc Apr 10 '16 at 20:30
2

Another awk solution and limited input with the size of memory you have.

awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
    END{ for (i in RtoC) print RtoC[i] }' infile

This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc. Will output:

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
αғsнιη
  • 2,627
  • 2
  • 25
  • 38
1

Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:

echo '' > tmp1;  \
cat m.txt | while read l ; \
            do    paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
                  cp tmp2 tmp1; \
            done; \
cat tmp1

m.txt:

0 1 2
4 5 6
7 8 9
10 11 12
  1. creates tmp1 file so it's not empty.

  2. reads each line and transforms it into a column using tr

  3. pastes the new column to the tmp1 file

  4. copies result back into tmp1.

PS: I really wanted to use io-descriptors but couldn't get them to work.

agc
  • 7,973
  • 2
  • 29
  • 50
kirill_igum
  • 3,953
  • 5
  • 47
  • 73
  • Make sure to set an alarm clock if you're going to execute that on a large file. Read http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice to understand some, but not all, of the problems with that approach. – Ed Morton Apr 10 '16 at 15:46
  • that's one line? – johny why Sep 01 '21 at 16:04
1
#!/bin/bash

aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#

#set -x
while read line; do
  set -- $line
  for i in $(seq $colNum); do
    eval col$i="\"\$col$i \$$i\""
  done
done < file.txt

for i in $(seq $colNum); do
  eval echo \${col$i}
done

another version with set eval

Dyno Fu
  • 8,753
  • 4
  • 39
  • 64
  • Read http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice to understand some, but not all, of the problems with that solution. – Ed Morton Apr 10 '16 at 15:43
1

Another bash variant

$ cat file 
XXXX    col1    col2    col3
row1    0       1       2
row2    3       4       5
row3    6       7       8
row4    9       10      11

Script

#!/bin/bash

I=0
while read line; do
    i=0
    for item in $line; { printf -v A$I[$i] $item; ((i++)); }
    ((I++))
done < file
indexes=$(seq 0 $i)

for i in $indexes; {
    J=0
    while ((J<I)); do
        arr="A$J[$i]"
        printf "${!arr}\t"
        ((J++))
    done
    echo
}

Output

$ ./test 
XXXX    row1    row2    row3    row4    
col1    0       3       6       9   
col2    1       4       7       10  
col3    2       5       8       11
Ivan
  • 6,188
  • 1
  • 16
  • 23
1

Here is a super simple Ruby:

ruby -lane 'BEGIN{lines=[]} 
    lines<<$F 
END{lines.transpose.each{|row| puts row.join("\t")}}' file

Or you can control the splitting yourself rather than relying on the command line switches of -lan:

ruby -e '$<.read.
            split(/\R+/).
            map(&:split).
            transpose.each{|row| puts row.join("\t")}' file

Or you can use Ruby's CSV module:

ruby -r csv -e '
tbl=CSV.parse($<.read, **{:headers=>false, :col_sep=>" "}) # or use :col_sep=>"\t" for tab columns 
tbl.transpose.each{|row| puts row.join("\t")}
' file 

Any prints:

X   row1    row2    row3    row4
column1 0   3   6   9
column2 1   4   7   10
column3 2   5   8   11
dawg
  • 98,345
  • 23
  • 131
  • 206
0

Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.

transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])

main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
stelleg
  • 11
0

An awk solution that store the whole array in memory

    awk '$0!~/^$/{    i++;
                  split($0,arr,FS);
                  for (j in arr) {
                      out[i,j]=arr[j];
                      if (maxr<j){ maxr=j}     # max number of output rows.
                  }
            }
    END {
        maxc=i                 # max number of output columns.
        for     (j=1; j<=maxr; j++) {
            for (i=1; i<=maxc; i++) {
                printf( "%s:", out[i,j])
            }
            printf( "%s\n","" )
        }
    }' infile

But we may "walk" the file as many times as output rows are needed:

#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
    awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
    echo
done

Which (for a low count of output rows is faster than the previous code).

0

A oneliner using R...

  cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "
dputhier
  • 734
  • 1
  • 7
  • 23
0

I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.

awk '
{
    for (i = 1; i <= NF; i++) {
        s[i] = s[i]?s[i] FS $i:$i
    }
}
END {
    for (i in s) {
        print s[i]
    }
}' file.txt
declare -a arr

while IFS= read -r line
do
    i=0
    for word in $line
    do
        [[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
        ((i++))
    done
done < file.txt

for ((i=0; i < ${#arr[@]}; i++))
do
    echo ${arr[i]}
done
Sam
  • 791
  • 6
  • 9
0

Simple 4 line answer, keep it readable.

col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
    awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done
Penny Liu
  • 15,447
  • 5
  • 79
  • 98
0
for i in $(seq $(head -n1 file.txt | tr ' ' '\n' | wc -l))
do
  cut -d' ' -f"$i" file.txt | paste -s -d' ' -
done

or

seq $(head -n1 file.txt | tr " " "\n" | wc -l) | xargs -I{} sh -c 'cut -d" " -f"{}" file.txt | paste -s -d" " -'
Jiangge Zhang
  • 4,298
  • 4
  • 25
  • 33
0

If you wanna just directly print the transposed form, and u know for sure the column/field count (i.e. NF) is consistent across the input file, then just do it all in one shot :

 {m,g}awk '
  END {  _____ = ", "
            __ = gsub("\n", "&")
          ____ = NF
      
      for(___ ^= _<_; ___<=__; ___++) {
         for(_ = ___; _<=____; _+=__) {
              
         printf("%s%s",$_, ____<(_+__)\
                       ? "\n" : _____) }  } }' FS='[[:space:]]+' RS=

|

0, 3, 6, 9
1, 4, 7, 10
2, 5, 8, 11

As for concerns regarding overloading system resources with too many fields at once, I've gotten mawk2 to allocate 1.331 billion fields, using a non-default OFS, in 10.5 secs :

( date | mawk2 '$!NF=NF=(191)^+(OFS="4")'; )  

 7.82s user 2.30s system 96% cpu 10.499 total
     1  1,330,863,361

By comparison, mawk 1.3.4 was only slightly behind at 11.03 secs

( date | mawk '$!NF=NF=(191)^+(OFS="4")'; )  

8.37s user 2.28s system 96% cpu 11.031 total
     1  1,330,863,361

… but the amazing gawk 5.1.1 took 6x the time to allocate only 39% of that amount :

( date | LC_ALL=C gawk -be '$-_=NF=(151)^+(OFS="4")'; )  

14.59s user 34.98s system 77% cpu 1:04.38 total

     1  519,885,601
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11