4

I am trying to process data stored in a text file which looks like this test.dat:

-1411.85  2.6888   -2.09945   -0.495947   0.835799   0.215353   0.695579   
-1411.72  2.82683   -0.135555   0.928033   -0.196493   -0.183131   -0.865999   
-1412.53  0.379297   -1.00048   -0.654541   -0.0906588   0.401206   0.44239   
-1409.59  -0.0794765   -2.68794   -0.84847   0.931357   -0.31156   0.552622   
-1401.63  -0.0235102   -1.05206   0.065747   -0.106863   -0.177157   -0.549252   
....
....

The file however is several GB and I would very much like to read it in, in small blocks of rows. I would like to use numpy's loadtxt function as this converts everything quickly to a numpy array. However, I have not been able to manage so far as the function seems to only offer a selection of columns like here:

data = np.loadtxt("test.dat", delimiter='  ', skiprows=1, usecols=range(1,7))

Any ideas how to achieve this? If it is not possible with loadtxt any other options available in Python?

  • loadtxt's fname argument can be a generator so to read small blocks of rows use a file read generator such as shown in nosklo's answer in http://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python but converted to read just a small number of lines instead of bytes. –  Aug 15 '15 at 17:10
  • 1
    See also: http://stackoverflow.com/a/27962976/901925 - `Fastest way to read every n-th row with numpy's genfromtxt` – hpaulj Aug 15 '15 at 17:46

3 Answers3

1

If you can use pandas, that would be easier:

In [2]: import pandas as pd

In [3]: df = pd.read_table('test.dat', delimiter='  ', skiprows=1, usecols=range(1,7), nrows=3, header=None)

In [4]: df.values
Out[4]:
array([[ 2.82683  , -0.135555 ,  0.928033 , -0.196493 , -0.183131 ,
        -0.865999 ],
       [ 0.379297 , -1.00048  , -0.654541 , -0.0906588,  0.401206 ,
         0.44239  ],
       [-0.0794765, -2.68794  , -0.84847  ,  0.931357 , -0.31156  ,
         0.552622 ]])

Edit

If you want to read say every k rows, you can specify chunksize. For example,

reader = pd.read_table('test.dat', delimiter='  ', usecols=range(1,7), header=None, chunksize=2)
for chunk in reader:
    print(chunk.values)

Out:

[[ 2.6888   -2.09945  -0.495947  0.835799  0.215353  0.695579]
 [ 2.82683  -0.135555  0.928033 -0.196493 -0.183131 -0.865999]]
[[ 0.379297  -1.00048   -0.654541  -0.0906588  0.401206   0.44239  ]
 [-0.0794765 -2.68794   -0.84847    0.931357  -0.31156    0.552622 ]]
[[-0.0235102 -1.05206    0.065747  -0.106863  -0.177157  -0.549252 ]]

You got to handle how to store them in the for-loop as you wish. Note that in this case reader is a TextFileReader, not DataFrame, so you can iterate through it lazily.

You can read this for more details.

yangjie
  • 6,619
  • 1
  • 33
  • 40
  • I do not see how I would for instance read the first three and then the second three and so on. Could you explain that as well please? Thanks for your efforts! –  Aug 15 '15 at 17:39
  • You mean read the first three into an ndarray then the next three into another ndarray and so on? – yangjie Aug 15 '15 at 17:47
  • Yes, that is what I need! –  Aug 15 '15 at 17:53
  • @andi That's not quite clearly stated in your question though. I didn't understand it at once either. – Eli Korvigo Aug 15 '15 at 17:59
  • probably with a ```try: read_table(...) /except EOFError: break``` statement nested in an infinite while loop – dermen Aug 15 '15 at 18:57
1

hpaulj pointed me in the right direction in his comment.

Using the following code works perfectly for me:

import numpy as np
import itertools
with open('test.dat') as f_in:
    x = np.genfromtxt(itertools.islice(f_in, 1, 12, None), dtype=float)
    print x[0,:]

Thanks a lot!

0

You might want to use an itertools recipe.

from itertools import izip_longest
import numpy as np


def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)


def lazy_reader(fp, nlines, sep, skiprows, usecols):
    with open(fp) as inp:
        for chunk in grouper(nlines, inp, ""):
            yield np.loadtxt(chunk, delimiter=sep, skiprows=skiprows, usecols=usecols)

The function returns a generator of arrays.

lazy_data = lazy_reader(...)
next(lazy_data)  # this will give you the next chunk
# or you can iterate 
for chunk in lazy_data:
    ...
Community
  • 1
  • 1
Eli Korvigo
  • 10,265
  • 6
  • 47
  • 73