Python - Efficient way to read large amounts of tabular data

Question

I have a file containing a large table of numbers, roughly 300 MB in size. I want to read this in Python.

Data looks like this:

-200 1 11097.4 16414.2 1
-200 1 11197.4 16414.8 1
-200 1 11297.4 16415.4 1
-200 1 11397.4 16416 1
-200 1 11497.4 16416.5 1
-200 1 11597.4 16417.1 1
-200 1 11697.4 16417.7 1

Python code looks like this:

    with open(filename) as f:
        nrow, ncol= [int(x) for x in next(f).split()] 
        for k in range(2):
            rr = []
            for i in range(nrow+1):
                row = []
                for j in range(ncol+1):
                    a = next(f).split()                     
                    row.append([int(a[0]), int(a[1]), float(a[2]), float(a[4])])
                rr.append(row)          
            summary.append(rr)

This is very slow; it takes about 60 seconds to read the file. I want to get it down to less than 10 seconds. What's the simplest way to make it a bit faster?

I am perfectly happy to change the data file format, if it helps.

I think you maybe read data wrong with all these three nested loops. what are values `nrow` and `ncol` that you read from the first line of the file? — buran, Sep 16 '19 at 17:31
It is correct, the table is 2*nrow*ncol entries, and each entry has 5 numbers written in one line. — CaptainCodeman, Sep 16 '19 at 17:33

Victor 'Chris' Cabral · Accepted Answer · 2019-09-16T19:51:00.503

4

Use pandas. This might be a duplicate so also check out these answers

code.py

import pandas as pd
import numpy as np

df = pd.read_csv("large_file.txt", sep="\s")
np.save("large_file.npz", df.values)

with load('large_file.npz') as data:
    print(data.shape)

edited Sep 16 '19 at 19:51

answered Sep 16 '19 at 17:24

Victor 'Chris' Cabral

2,135
1
16
33

I greatly appreciate this response, but unfortunately it still takes forever (minutes) to read the file. It is largely the parsing that is a problem, even in C it takes a long time unless I change the data format to a binary one. Is there an equivalent or similar solution in python? – CaptainCodeman Sep 16 '19 at 17:32
If you want the whole thin in memory at the same time then that is your only option, if you can read it in by parts then pass in the chunksize. – Victor 'Chris' Cabral Sep 16 '19 at 19:17
Also you can read it from multiple cores but this problem might just be I/O bound. – Victor 'Chris' Cabral Sep 16 '19 at 19:17
1

You can try and save/load in a different format. Try using numpy's save and load. You can save the .values into a file and see how long that takes to load. – Victor 'Chris' Cabral Sep 16 '19 at 19:49
numpy.core._exceptions.MemoryError: Unable to allocate array with shape (2, 5001, 1501, 4) and data type float64 – CaptainCodeman Sep 16 '19 at 22:36
1

It seems numpy runs out of memory. – CaptainCodeman Sep 16 '19 at 22:41

Python - Efficient way to read large amounts of tabular data

1 Answers1

code.py