0

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...

The structure of my file is like this:

string value_1 .... value_n

The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this: string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.

for line in f:
    tokens = line.strip().split()
    if len(tokens) <= 5: #ignore w2v first line
      continue
    word = tokens[0]
    number_of_columns = len(tokens)-1

    features = {} 
    for dim, val in enumerate(tokens[1:]):
      val = float(val)
      features[dim] = val
    matrix[word] = features

This will result Killed in the second case while will work in the first case.

Nick
  • 367
  • 4
  • 6
  • 13

1 Answers1

0

I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:

May i suggest you use Pandas for this kind of work? It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html

import pandas as pd

pd.read_csv('file.txt', sep=' ', skiprows=1)

then do all your manipulations Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

MrE
  • 19,584
  • 12
  • 87
  • 105