csvtk
-t
uses tab as delimiter and -d
specifies delimiter:
$ seq 4|paste - -|csvtk -t transpose
1 3
2 4
$ seq 4|paste -d, - -|csvtk -d, transpose
1,3
2,4
-l
(--lazy-quotes
) allows double quotes within a field which is not surrounded by double quotes, but I don't think there's any way to disable CSV-style escaping for double quotes in the output:
$ csvtk -t transpose<<<$'aa"bb\t2\n3\t4'
[ERRO] parse error on line 1, column 3: bare " in non-quoted-field
$ csvtk -lt transpose<<<$'aa"bb\t2\n3\t4'
"aa""bb" 3
2 4
Another caveat is that -l
doesn't prevent removing double quotes around fields that don't need to be quoted in CSV:
$ csvtk -lt transpose<<<$'"1"\t2\n3\t4'
1 3
2 4
awk
Gawk version which uses arrays of arrays:
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}
Plain awk version which uses multidimensional arrays (this took about twice as long to run in my benchmark):
tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}
macOS comes with a version of Brian Kerningham's nawk
from 2007 which doesn't support arrays of arrays.
To use space as a field separator without collapsing sequences of multiple spaces, use FS='[ ]'
.
rs
rs
is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.
Use sequences of spaces and tabs as column separator:
rs -T
Use tab as column separator:
rs -c -C -T
Use comma as column separator:
rs -c, -C, -T
-c
changes the input column separator and -C
changes the output column separator. A lone -c
or -C
sets the separator to tab. -T
transposes rows and columns.
Do not use -t
instead of -T
, because it selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w
).
When an output column separator is specified using -C
, an extra column separator character is added to the end of each row, but you can remove it with sed
:
$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4
rs -T
determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:
$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,
R
The t
function transposes a matrix or a dataframe:
Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'
If you replace Rscript -e
with R -e
, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal
if the R command is followed by a command like head -n1
which exits before it has read the whole STDIN.
quote=""
can be removed if the input doesn't contain double quotes or single quotes, and comment.char=""
can be removed if the input doesn't contain lines that start with a hash character.
For large input, data.table::fread
is usually faster than read.table
and data.table::fwrite
is faster than write.table
:
$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m1.061s
user 0m0.983s
sys 0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m0.599s
user 0m0.535s
sys 0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>/dev/null
x being coerced from class: matrix to data.table
real 0m0.375s
user 0m0.296s
sys 0m0.073s
jq
tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}
jq -R .
prints each input line as a JSON string literal, -s
(--slurp
) creates an array for the input lines after parsing each line as JSON, and -r
(--raw-output
) outputs the contents of strings instead of JSON string literals. The /
operator is overloaded to split strings.
Ruby
ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'
The -1
argument to split
disables discarding empty fields at the end:
$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]
Function form:
$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4
The function above uses s==" "?/ /:s
because when the argument to the split
function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:
$ ruby -e'p" a \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]
C++
Compile with cpp -O3 -o tp tp.cpp
:
#include<iostream>
#include<vector>
#include<sstream>
using namespace std;
int main(int argc,char**argv){
vector<vector<string> >table;
vector<string>row;
string line,field;
char sep=argc==1?'\t':argv[1][0];
while(getline(cin,line)){
row.clear();
stringstream str(line);
while(getline(str,field,sep))row.push_back(field);
table.push_back(row);
}
int rows=table.size(),cols=table[0].size();
for(int i=0;i<cols;i++){
cout<<table[0][i];
for(int j=1;j<rows;j++)cout<<sep<<table[j][i];
cout<<endl;
}
}
Benchmark
csvtk
was the fastest and jq
was the slowest:
$ seq 1e6|awk 'ORS=NR%1e3?"\t":"\n"'>test
$ TIMEFORMAT=%R
$ time ./tp<test>/dev/null # first run of C++ program is slower
0.895
$ time ./tp<test>/dev/null
0.520
$ time csvtk -t transpose<test>/dev/null
0.142
$ time rs -c -C -T<test|sed $'s/\t$//'>/dev/null
0.587
$ time gawk -F\\t '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' test>/dev/null
1.119
$ time awk -F\\t '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}'<test>/dev/null
1.858
$ time jq -R . test|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]'>/dev/null
3.604
$ time ruby -e'STDIN.map{|x|x.chomp.split("\t",-1)}.transpose.each{|x|puts x*"\t"}'<test>/dev/null
0.492