For one of my data analysis pipelines, I end up generating a lot of individual CSV files. I would like to transpose them, concatenate them, and transpose them again. However, the amount of data is large, so loading it all into memory is not practical.
-
Are you using [numpy](http://numpy.scipy.org/)? How much data is it? – Keith Aug 23 '11 at 06:06
-
1just to be sure, what do you mean by concatenate? If you were to concatenate A and B would you want the first line to be a11,a12,..,a1n,b11,b12,..b1n? – Rusty Rob Aug 23 '11 at 06:06
-
so you want to concatenate them linewise ? – rocksportrocker Aug 23 '11 at 07:38
-
Do your columns/fields have fixed widths/sizes? – tommy.carstensen Apr 07 '13 at 23:27
5 Answers
Another short and pythonic solution. I used this to transpose CSVs that are 15,000,000 x 12,000 . It's fast and pure python. Everything else you need done is trivial and this is definitely the hardest part.
Github link: https://gist.github.com/arose13/facfb91b609d453f3ad840417faa503a
def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
"""
On my laptop it can transpose at ~375,000 lines a sec
:param csv_path:
:param output_csv_path:
:param delimiter:
:return:
"""
import csv
transposed_iterator = zip(*csv.reader(open(csv_path)))
with open(output_csv_path, 'w') as out:
for row in transposed_iterator:
out.write(delimiter.join(row) + '\n')

- 3,558
- 5
- 39
- 49
Concatenating the rows of data from two csv files (if that's what you meant) without loading all of both of them into memory is a relatively easy and fast operation: Just read in a single row from each one, join those together, and then write that to an output file, repeating until all the input data is exhausted.
Transposing the data in a csv file is without reading the entire thing into memory is intrinsically going to be a much slower process, since it requires the entire input file to be reread in multiple passes, each time extracting that data from just one column it contains. If that's an acceptable (or necessary) trade-off, here's basically how it would be done using the built-in csv
module:
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
with open(output_filename, 'wb') as outputf:
writer = csv.writer(outputf)
with open(input_filename, 'rb') as inputf:
# determine number of columns in input file by counting those in its first row
# number of cols in input file determines number of rows in output file
numcols = len(csv.reader(inputf).next())
# read entire input file multiple times, extracting one column from each row
for col_index in xrange(numcols):
# write all of column data as a single row of the output file
inputf.seek(0) # rewind file for each pass
writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))

- 119,623
- 25
- 170
- 301
-
If the columns/fields have fixed widths/sizes, then he can use seek instead of looping over the file multiple times. – tommy.carstensen Apr 07 '13 at 23:27
-
@tommy.carstensen: For the special-case of fixed-width fields, it's not clear that doing a seek for each field or row would be faster than just doing a single one to rewind the whole file at the beginning. Also, for fixed-width lines, it seems likely it would be faster to adapt my [answer](http://stackoverflow.com/a/4915359/355230) to another question which is based on using the `struct` module for parsing the file rather than the `csv` module. _However_, the OP said nothing about fixed-width fields. – martineau Apr 08 '13 at 02:23
-
-
I get this error when trying to use your csv solution above: Traceback (most recent call last): File "python.py", line 16, in
writer.writerow(tuple(row[col_index] for row in csv.reader(inputf))) File "python.py", line 16, in – tommy.carstensen Apr 08 '13 at 09:38writer.writerow(tuple(row[col_index] for row in csv.reader(inputf))) _csv.Error: field larger than field limit (131072) -
@tommy.carstensen: I tested my code before posting it verify it worked. Sounds like you may have a malformed input file that is missing a delimiter between fields or something. You should be able to print each item the generator expression is returning and determine the location within the file that this occurs. – martineau Apr 08 '13 at 12:10
-
It works great on smaller files and it is lightning fast (fastest of 8 different methods I tried and one of them being a C solution). It just doesn't work, when I exceed 2^17 (131072) rows/lines in the input file (i.e. 131072 columns/fields in the output file). – tommy.carstensen Apr 08 '13 at 16:25
-
@tommy.carstensen: Must be a limitation of the `csv` module. It's often fairly easy to forgo its use for input by instead using other built-ins, like regular `file` objects and `str.split()` etc, to accomplish the same thing. – martineau Apr 08 '13 at 18:38
Here is a solution that works, when fields have fixed widths:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
with open(path_in) as fd_in:
line = fd_in.readline()
l = line.split()
field_width = int(len(line)/len(l))
file_size = os.path.getsize(path_in)
cols2 = rows1 = line_count = int(file_size/len(line))
rows2 = cols1 = len(l)
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row in range(rows2):
for col in range(cols2-1):
fd_in.seek(col*len(line)+row*field_width)
fd_out.write('{} '.format(fd_in.read(field_width-1)))
fd_in.seek((col+1)*len(line)+row*field_width)
fd_out.write('{}\n'.format(fd_in.read(field_width-1)))
return
if __name__ == '__main__':
main()
Here is a solution that works, if the fields don't have fixed widths:
import sys
import os
def main():
path_in = sys.argv[-1]
path_out = os.path.basename(path_in)+'.transposed'
separator = ' '
d_seek = {}
with open(path_in) as fd_in:
i = 0
while True:
tell = fd_in.tell()
if fd_in.readline() == '':
break
d_seek[i] = tell
i += 1
cols2 = rows1 = i
with open(path_in) as fd_in:
line = fd_in.readline()
rows2 = cols1 = len(line.split(separator))
del line
with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
for row2 in range(rows2):
for row1 in range(rows1):
fd_in.seek(d_seek[row1])
j = 0
s = ''
while True:
char = fd_in.read(1)
j += 1
if char == separator or char == '\n':
break
s += char
d_seek[row1] += len(s)+1
if row1+1 < rows1:
fd_out.write('{} '.format(s))
else:
fd_out.write('{}\n'.format(s))
return
if __name__ == '__main__':
main()

- 8,962
- 15
- 65
- 108
The following code simulates reading from two csv files. The first one has the two rows
[1,2,1]
[3,4,1]
the second one
[7,8,2]
[9,10.2].
The result are the two rows
[1,2,1,7,8,2]
[3,4,1,9,10,2]
Is that what you wanted ?
def source1():
for i in [ [1,2, 1] ,[3,4, 1]] : yield i
def source2():
for i in [ [7,8,2] ,[9,10,2]] : yield i
def join(*sources):
while True:
row = []
for s in sources:
row.extend(s.next())
yield row
for row in join(source1(), source2()):
print row
In your case you have to replace calls to source1() and source2() by the csv file iterators.

- 7,251
- 2
- 31
- 48
Use a generator, e.g.
from itertools import izip
file1 = open("test", "r")
file2 = open("test2", "r")
def lazy(file):
for line in file:
#do something with the line
yield line
for lines in izip(lazy(file1), lazy(file2)):
print lines
http://wiki.python.org/moin/Generators
Edit: You can use the CSV module to parse it, also I realized that the readlines() method of file objects isn't lazy, so you have to use the for line in file pattern.

- 2,100
- 1
- 17
- 31
-
you could go "import csv" and then go f1,f2=csv.reader(open('f1.txt','rb'),delimiter=',',quotechar='"'),csv.reader(open('f2.txt','rb'),delimiter=',',quotechar='"') – Rusty Rob Aug 23 '11 at 06:18
-
Yeah, I don't really use CSV files much, so I don't know about any libraries that python has for them. – Wes Aug 23 '11 at 06:31
-
the csv module is simple to use. It does the splitting for you. e.g. if your csv file is using quote characters with line1: firstname,lastname,scores and then line2: bob,smith,"98,95,23,99" then for line2 it would return ['bob','smith','99,95,23,99'] – Rusty Rob Aug 23 '11 at 06:37
-
Also if file1 has a field/column name in common with file2, when you concatenate you will have duplicate column names. (this will likely cause an error upon insertion into a mysql table for example) – Rusty Rob Aug 23 '11 at 06:39
-
2You do not need the lazy function here, as file() follows the iterator protocol. – rocksportrocker Aug 23 '11 at 11:34
-
-1: Sorry, but rocksportrocker is right, using `lazy()` is completely unnecessary. – martineau Mar 23 '13 at 19:35
-
That wasn't the point of the answer. The point was to explain it. Anyway you would probably want yield if you wanted to alter the data in some way in a function. – Wes Mar 25 '13 at 14:51
-
You also didn't address the issue of transposing that data in the files in your answer, which the OP also said he needed to do. – martineau Apr 08 '13 at 02:34