Going out of memory for python dictionary when the numbers are integer

Question

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...

The structure of my file is like this:

string value_1 .... value_n

The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this: string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.

for line in f:
    tokens = line.strip().split()
    if len(tokens) <= 5: #ignore w2v first line
      continue
    word = tokens[0]
    number_of_columns = len(tokens)-1

    features = {} 
    for dim, val in enumerate(tokens[1:]):
      val = float(val)
      features[dim] = val
    matrix[word] = features

This will result Killed in the second case while will work in the first case.

you're converting the strings into float, that's why it takes more memory — Walid Saad, Jan 07 '16 at 18:36
When I don't convert it to float it runs out of memory. Is there anything I can try to avoid it? — Nick, Jan 07 '16 at 18:44

score 0 · Answer 1 · answered Jan 07 '16 at 21:10

I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:

May i suggest you use Pandas for this kind of work? It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html

import pandas as pd

pd.read_csv('file.txt', sep=' ', skiprows=1)

then do all your manipulations Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

Going out of memory for python dictionary when the numbers are integer

1 Answers1