I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ...
with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ...
with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed
in the second case while will work in the first case.