5

enter image description hereI have some very large txt fils(about 1.5 GB ) which I want to load into Python as an array. The Problem is in this data a comma is used as a decimal separator. for smaller fils I came up with this solution:

import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
data = np.char.replace(data, ',', '.')
data = np.char.replace(data, '\'', '')
data = np.char.replace(data, 'b', '').astype(np.float64)

But for the large fils Python runs into an Memory Error. Is there any other more memory efficient way to load this data?

kilojoules
  • 9,768
  • 18
  • 77
  • 149
Greg.P
  • 61
  • 1
  • 2
  • 5

2 Answers2

3

The problem with np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1) is that it uses python objects (strings) instead of float64, which is very memory inefficient. You can use pandas read_table

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table

to read your file and set decimal=',' to change the default behaviour. This will allow for seamless reading and converting your strings into floats. After loading pandas dataframe use df.values to get a numpy array. If it's still too large for your memory use chunks

http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

If still no luck try np.float32 format which further halves memory footprint.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Dennis Sakva
  • 1,447
  • 2
  • 13
  • 26
1

You should try parsing it yourself, iterating for each line (so implicitly using a generator that does not read all the file into memory). Also, for data of that size, I would use python standard array library, that uses similar memory as a c array. That is, one value next to the other in memory (numpy array is also very efficient in memory usage though).

import array

def convert(s): 
  # The function that converts the string to float
  s = s.strip().replace(',', '.')
  return float(s)

data = array.array('d') #an array of type double (float of 64 bits)

with open(filename, 'r') as f:
    for l in f: 
        strnumbers = l.split('\t')
        data.extend( (convert(s) for s in strnumbers if s!='') )
        #A generator expression here. 

I'm sure similar code (with similar memory footprint) can be written replacing the array.array by numpy.array, specially if you need to have a two dimensional array.

eguaio
  • 3,754
  • 1
  • 24
  • 38