2

I have a 3 columns file of about 28Gb. I would like to read it with python and put its content in a list of 3D tuples. Here's the code I'm using :

f = open(filename)
col1 = [float(l.split()[0]) for l in f]
f.seek(0)
col2 = [float(l.split()[1]) for l in f]
f.seek(0)
col3 = [float(l.split()[2]) for l in f]
f.close()
rowFormat = [col1,col2,col3]
tupleFormat = zip(*rowFormat)
for ele in tupleFormat: 
        ### do something with ele

There's no 'break' command in the for loop, meaning that I actually read the whole content of the file. When the script is being run, I notice from the 'htop' command that it takes 156G of virtual memory (VIRT column) and almost the same amount for the resident memory (RES column). Why is my script using 156G whereas the file size is only 28G ?

dada
  • 1,390
  • 2
  • 17
  • 40
  • 1
    Even an `int` is an object with header and takes up more space than you might expect. Maybe you can use [`numpy.loadtxt()`](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html)? – Reti43 Apr 06 '16 at 19:52
  • Why do you read the file three times? – Peter Wood Apr 06 '16 at 19:57
  • Why do you need it all in memory at the same time? – Peter Wood Apr 06 '16 at 19:58
  • Also, floats seems to take 24bytes of space per instance (check with `sys.getsizeof(float(0))` – Lauro Moura Apr 06 '16 at 19:58
  • I did expect the process to use more memory than the actual size of my file, but I'm surprised that it uses memory of about 6 times the size of the file ! This is annoying because I don't think I have access to clusters with more than 200G of memory. – dada Apr 06 '16 at 19:59
  • Maybe using pandas (reading through `pandas.read_csv`?) would be better. – Lauro Moura Apr 06 '16 at 19:59
  • I will try to use 'readlines' along with 'strip' and 'split' hoping my job will require less memory with those commands. – dada Apr 06 '16 at 20:02
  • @LauroMoura Yeah. Between that and the list references for each element seems to pretty much add up. – Reti43 Apr 06 '16 at 20:02
  • Woud c++ be faster than python in reading the file ? – dada Apr 12 '16 at 02:54

3 Answers3

4

Python objects have a lot of overheard, e.g., reference count to the object and other stuff. That means that a Python float is more than 8 bytes. On my 32bit Python version, it is

>>> import sys
>>> print(sys.getsizeof(float(0))
16

A list has its own overhead and then requires 4 bytes per element to store a reference to that object. So 100 floats in a list actually take up a size of

>>> a = map(float, range(100))
>>> sys.getsizeof(a) + sys.getsizeof(a[0])*len(a)
2036

Now, a numpy array is different. It has a little bit of overhead, but the raw data under the hood are stored like in C.

>>> import numpy as np
>>> b = np.array(a)
>>> sys.getsizeof(b)
848
>>> b.itemsize    # number of bytes per element
8

So a Python float requires 20 bytes compared to 8 for numpy. And 64bit Python versions require even more.

So really, if you must load A LOT of data in memory, numpy is one way to go. Looking at the way you load the data, I assume it's in text format with 3 floats per row, split by an arbitrary number of spaces. In that case, you could simply will use numpy.genfromtxt()

data = np.genfromtxt(fname, autostrip=True)

You could also look for more options here, e.g., mmap, but I don't know much about it to say whether it'd be more appropriate for you.

Community
  • 1
  • 1
Reti43
  • 9,656
  • 3
  • 28
  • 44
  • Well, using 'np.loadtxt' made my process use less memory, so I guess it's a solution. But I have a question : would it be faster to read the variable 'tupleFormat' from a 'pickle' representation in comparison with opening the file and reading it ? I notice that the pickle representation of 'tupleFormat' is at least twice the size of my original file, but I don't know maybe reading it will be faster than opening the file ? – dada Apr 10 '16 at 16:39
  • @dada What do you want to do with that data? Reading/loading off the hard drive is a slow process. If you need to load/check a lot of values many times, you're better off loading them in memory once unless that is a physical restriction. – Reti43 Apr 13 '16 at 15:50
0

You need to read it line by line lazily using a generator. Try this:

col1 = []
col2 = []
col3 = []

rowFormat = [col1, col2, col3]

with open('test', 'r') as f:
    for line in f:
        parts = line.split()
        col1.append(float(parts[0]))
        col2.append(float(parts[1]))
        col3.append(float(parts[2]))
        # if possible do something here to start seeing results immediately

tupleFormat = zip(*rowFormat)
for ele in tupleFormat:
    ### do something with ele

You can add your logic in the for loop so you don't wait for the whole process to finish.

fips
  • 4,319
  • 5
  • 26
  • 42
  • Wouldn't this still read it in one go instead of lazily? I mean to become a generator wouldn't he need to `yield` the current value for each line? – Lauro Moura Apr 06 '16 at 20:06
  • See http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python – Lauro Moura Apr 06 '16 at 20:08
  • 1
    Why would this reduce the amount of memory used by the process running my script ? – dada Apr 06 '16 at 20:09
  • Yes it will. 3rd option in the link above - "If the file is line-based, the file object is already a lazy generator of lines". – fips Apr 06 '16 at 20:10
  • Sorry just saw you asked why, it will reduce the amount of memory because you will reading line by line in memory. This is different than what you did because the list comprehension you used [x for x in f] iterates the file and reads the whole column in each variable. And not only that you are also reading the file 3 times for that. What I'm suggesting is an improvement, although it seems using numpy - as Reti43 suggested - could be an even better solution. – fips Apr 06 '16 at 20:30
0

Can you get by w/o storing every tuple? I.e. can "do something" happen as you read in the file? If so... try this:

#!/usr/bin/env python
import fileinput
for line in fileinput.FileInput('test.dat'):
    ele = tuple((float(x) for x in line.strip().split()))
    # Replace 'print' with your "do something".
    # Note that ele is now a generator, not a tuple.  Wrap it in
    # ele = tuple(ele) to get a tuple instead if you need it.
    print ele

If not, maybe you can save some memory by choosing either the column format or the list of tuples format, but not BOTH, for example....

#!/usr/bin/env python
import fileinput
elements = []
for line in fileinput.FileInput('test.dat'):
    elements.append(tuple((float(x) for x in line.strip().split())))

for ele in elements:
   # do something
Brian McFarland
  • 9,052
  • 6
  • 38
  • 56