6

the file contains 2000000 rows: each row contains 208 columns, separated by comma, like this:

0.0863314058048,0.0208767447842,0.03358010485,0.0,1.0,0.0,0.314285714286,0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0

The program read this file to a numpy narray, I expected it will consume about (2000000 * 208 * 8B) = 3.2GB memory. However, when the program read this file, I found the program consumes about 20GB memory.

I am confused about why my program consumes so much memory that do not meet expectation?

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
祝方泽
  • 135
  • 1
  • 8
  • Can you show the exact line of code that reads the data from file? It is hard to answer if we have to guess. – Bas Swinckels Oct 26 '14 at 06:04
  • @BasSwinckels thank you, i use np.loadtxt() to read data. Saullo Castro has pointed the problem and explained this problem roughly. – 祝方泽 Oct 26 '14 at 09:43

2 Answers2

2

I'm using Numpy 1.9.0 and the memory inneficiency of np.loadtxt() and np.genfromtxt() seems to be directly related to the fact they are based on temporary lists to store the data:

  • see here for np.loadtxt()
  • and here for np.genfromtxt()

By knowing beforehand the shape of your array you can think of a file reader that will consume an amount of memory very close to the theoretical amount of memory (3.2 GB for this case), by storing the data using the corresponding dtype:

def read_large_txt(path, delimiter=None, dtype=None):
    with open(path) as f:
        nrows = sum(1 for line in f)
        f.seek(0)
        ncols = len(f.next().split(delimiter))
        out = np.empty((nrows, ncols), dtype=dtype)
        f.seek(0)
        for i, line in enumerate(f):
            out[i] = line.split(delimiter)
    return out
Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
  • having seen the sample row, there may be a vast memory saving once sparse-matrix would rather be used, doesn't it? – user3666197 Oct 26 '14 at 08:11
  • @user3666197 surely yes, but that would require a more complex reader function.... – Saullo G. P. Castro Oct 26 '14 at 08:13
  • sure, the OP issue seems to be the memory-bound, so this was a direction to tradeoff potentially blocking memory-bound issue for CPU-bound efforts, that make both the input per-se & the further processing feasible on even larger dataSETs ( my gut sense tells the OP is not seeking a one-liner or a few SLOC-s, but a feasible approach to input & process similar batches of data with numpy comfort, so will pay the cost of a bit smarter input-pre-processor ) – user3666197 Oct 26 '14 at 08:16
  • @user3666197 I've tested here and the problem with `np.loadtxt()` and also `np.genfromtxt()` is not knowing the shape, forced to use temporary lists and `list.append()` (see [here](https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L859) and [here](https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L1640)) – Saullo G. P. Castro Oct 26 '14 at 08:21
  • 1
    That was out of question, Saullo, as addressed in your Answer, the input-processor related issue. Excuse my remark, it just touched the proper ( a more efficient ) matrix-representation for the dataSET. – user3666197 Oct 26 '14 at 08:26
  • @Saullo Castro thanks your explain and your code, i try your code and find it consumes about 3.2GB. – 祝方泽 Oct 26 '14 at 09:32
  • @user3666197 I want to train a machine learning classifier use the data from this file, so the data in memory can not be compacted. thank you for your careful observation. – 祝方泽 Oct 26 '14 at 09:38
  • @祝方泽 Thought that you try to import some training dataSET, still, ML tools can work with sparse matrices, without wasting dtype=np.float96 on empty/zero cells. You may benefit on this, once your dataSETs grow a bit bigger – user3666197 Oct 26 '14 at 09:46
  • @user3666197 thank you, your advice indeed give me a new aspect, thank you very much. – 祝方泽 Oct 26 '14 at 10:49
0

I think you should try pandas to handle big data ( text files). pandas is like a excel in python. And it internally use numpy to represent the data.

HDF5 files also an another method to save big data into hdf5 binary file.

This question would give some idea about how to handle big files - "Large data" work flows using pandas

Community
  • 1
  • 1
Haridas N
  • 529
  • 5
  • 20