5

My main goal is to calculate median(by columns) from a HUGE matrix of floats. Example:

a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))

numpy.median(a, axis=0)

Out[38]: array([ 1.,  5.,  8.,  2.,  3.])

The matrix is too big to fit in the Python memory (~5 terabytes), so I keep it in a csv file. So I want to run over each column and calculate median.

Is there any way for me to get column iterator without reading the whole file?

Any other ideas about calculating the median for the matrix would be good too. Thank you!

dbr
  • 165,801
  • 69
  • 278
  • 343
dbaron
  • 103
  • 5

4 Answers4

3

If you can fit each column into memory (which you seem to imply you can), then this should work:

import itertools
import csv

def columns(file_name):
   with open(file_name) as file:
       data = csv.reader(file)
       columns = len(next(data))
   for column in range(columns):
       with open(file_name) as file:
           data = csv.reader(file)
           yield [row[column] for row in data]

This works by finding out how many columns we have, then looping over the file, taking the current column's item out of each row. This means, at most, we are using the size of a column plus the size of a row of memory at one time. It's a pretty simple generator. Note we have to keep reopening the file, as we exhaust the iterator when we loop through it.

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • If reopening the file is a problem, just move the `with` outside the for loop and do `file.seek(0)` inside. – Mu Mind Sep 22 '12 at 22:46
  • @MuMind That's a good alternative to reopening again and again (and also would mean you could pass a file object in case you didn't have a filename for whatever reason). – Gareth Latty Sep 22 '12 at 22:47
1

I would do this by initializing N empty files, one for each column. Then read the matrix one row at a time and send each column entry to the correct file. Once you've processed the whole matrix, go back and calculate the median of each file sequentially.

This basically uses the filesystem to do a matrix transpose. Once transposed, calculating the median of each row is easy.

Keith Randall
  • 22,985
  • 2
  • 35
  • 54
  • 1
    thank you for your response! my matrix size is ~5 terabytes, I'm afraid I don't have enough storage to do this :( – dbaron Sep 22 '12 at 22:11
1

There's probably no direct way to do what you're asking with a csv file (unless I've misunderstood you). The problem is that there's no meaningful sense in which any file has "columns" unless the file is specially designed to have fixed width rows. CSV files aren't generally designed that way. On disk, they're nothing more than a giant string:

>>> import csv
>>> with open('foo.csv', 'wb') as f:
...     writer = csv.writer(f)
...     for i in range(0, 100, 10):
...         writer.writerow(range(i, i + 10))
... 
>>> with open('foo.csv', 'r') as f:
...     f.read()
... 
'0,1,2,3,4,5,6,7,8,9\r\n10,11,12,13,14,15,16,17,18,19\r\n20..(output truncated)..

As you can see, the column fields don't line up predictably; the second column starts at index 2, but then in the next row, the width of columns increases by one, throwing off the alignment. This is even worse when input lengths vary. The upshot is that the csv reader will have to read the entire file, throwing out the data you don't use. (If you don't mind that, then that's the answer -- read the whole file line by line, throwing out the data you won't use.)

If you don't mind wasting some space and know that none of your data will be longer than some fixed width, you could create a file with fixed-width fields, and then you could seek through it using offsets. But then, once you're doing that, you might as well start using a real database. PyTables seems to be the favorite choice of many for storing numpy arrays.

Community
  • 1
  • 1
senderle
  • 145,869
  • 36
  • 209
  • 233
  • 1
    +1 If you're going to be doing this more than once, CSV is a poor choice of format to keep it in. – Mu Mind Sep 22 '12 at 22:41
  • @senderle DB is my goal. Do you know if numpy.loadtxt(file_path, usecols=[1,2,3]) will do the trick for now? – dbaron Oct 09 '12 at 19:58
  • @dbaron, it just depends on what you mean by "do the trick." I'm pretty sure that `usecols=[1, 2, 3]` will avoid loading the whole matrix into memory at once, so in that sense, yes. I'm also pretty sure it will _read_ the whole file, line by line, throwing out unused data, so in that sense, no. – senderle Oct 09 '12 at 20:24
0

You can use bucketsort to sort each of the columns on disk without reading them all into memory. Then you can simply pick the middle value.

Or you can use the UNIX awk and sort commands to split and then sort your columns before you select the median.

Hans Then
  • 10,935
  • 3
  • 32
  • 51