Python: load data with comma as decimal separator

Question

I have some very large txt fils(about 1.5 GB ) which I want to load into Python as an array. The Problem is in this data a comma is used as a decimal separator. for smaller fils I came up with this solution:

import numpy as np
data= np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1)
data = np.char.replace(data, ',', '.')
data = np.char.replace(data, '\'', '')
data = np.char.replace(data, 'b', '').astype(np.float64)

But for the large fils Python runs into an Memory Error. Is there any other more memory efficient way to load this data?

Look at locale settings: http://stackoverflow.com/a/19208247/3377691 — VBB, Jan 11 '17 at 11:47
can you provide an extract of sample data from your file? Is it all just in 1 line? — товіаѕ, Jan 11 '17 at 12:56
Check out this question/answer: http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#8964779 — Ed Smith, Jan 11 '17 at 15:48

score 3 · Answer 1 · edited Nov 15 '19 at 09:37

The problem with np.loadtxt(file, dtype=np.str, delimiter='\t', skiprows=1) is that it uses python objects (strings) instead of float64, which is very memory inefficient. You can use pandas read_table

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table

to read your file and set decimal=',' to change the default behaviour. This will allow for seamless reading and converting your strings into floats. After loading pandas dataframe use df.values to get a numpy array. If it's still too large for your memory use chunks

http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

If still no luck try np.float32 format which further halves memory footprint.

eguaio · Answer 2 · 2017-01-11T17:12:32.490

You should try parsing it yourself, iterating for each line (so implicitly using a generator that does not read all the file into memory). Also, for data of that size, I would use python standard array library, that uses similar memory as a c array. That is, one value next to the other in memory (numpy array is also very efficient in memory usage though).

import array

def convert(s): 
  # The function that converts the string to float
  s = s.strip().replace(',', '.')
  return float(s)

data = array.array('d') #an array of type double (float of 64 bits)

with open(filename, 'r') as f:
    for l in f: 
        strnumbers = l.split('\t')
        data.extend( (convert(s) for s in strnumbers if s!='') )
        #A generator expression here.

I'm sure similar code (with similar memory footprint) can be written replacing the array.array by numpy.array, specially if you need to have a two dimensional array.

Python: load data with comma as decimal separator

2 Answers2