NumPy reading file with filtering lines on the fly

Question

I have a large array of numbers written in a CSV file and need to load only a slice of that array. Conceptually I want to call np.genfromtxt() and then row-slice the resulting array, but

the file is so large that may not to fit in RAM
the number of relevant rows might be small, so there is no need to parse every line.

MATLAB has the function textscan() that can take a file descriptor and read only a chunk of the file. Is there anything like that in NumPy?

For now, I defined the following function that reads only the lines that satisfy the given condition:

def genfromtxt_cond(fname, cond=(lambda str: True)):
  res = []
  with open(fname) as file:
    for line in file:
      if cond(line):
        res.append([float(s) for s in line.split()])

  return np.array(res, dtype=np.float64)

There are several problems with this solution:

not general: supports only the float type, while genfromtxt detects the types, which may vary from column to column; also missing values, converters, skipping, etc.;
not efficient: when the condition is difficult, every line may be parsed twice, also the used data structure and reading bufferization may be suboptimal;
requires writing code.

Is there a standard function that implements filtering, or some counterpart of MATLAB’s textscan?

Why not just ``yield line`` or ``yield line.split()`` instead of building ``res`` then it's agnostic on the data in ``line``. — sotapme, Feb 01 '13 at 12:03
@sotapme How is it helpful? Do you mean generator is faster than `append`? I need the filtered np.array() in the end, anyway — Roman Shapovalov, Feb 01 '13 at 12:34
You said that yours only supported ``float`` so as you'd already generalised ``cond`` I thought yielding the line would allow you to use the same ``genfromtxt_cond`` irrespective of the line data. I was thinking of code reuse and not performance. — sotapme, Feb 01 '13 at 14:29

score 16 · Accepted Answer · answered Feb 09 '13 at 20:29

I can think of two approaches that provide some of the functionality you are asking for:

To read a file either in chunks / or in strides of n-lines / etc.:
You can pass a generator to numpy.genfromtxt as well as to numpy.loadtxt. This way you can load a large dataset from a textfile memory-efficiently while retaining all the convenient parsing features of the two functions.
To read data only from lines that match a criterion that can be expressed as a regex:
You can use numpy.fromregex and use a regular expression to precisely define which tokens from a given line in the input file should be loaded. Lines not matching the pattern will be ignored.

To illustrate the two approaches, I'm going to use an example from my research context.
I often need to load files with the following structure:

6
 generated by VMD
  CM         5.420501        3.880814        6.988216
  HM1        5.645992        2.839786        7.044024
  HM2        5.707437        4.336298        7.926170
  HM3        4.279596        4.059821        7.029471
  OD1        3.587806        6.069084        8.018103
  OD2        4.504519        4.977242        9.709150
6
 generated by VMD
  CM         5.421396        3.878586        6.989128
  HM1        5.639769        2.841884        7.045364
  HM2        5.707584        4.343513        7.928119
  HM3        4.277448        4.057222        7.022429
  OD1        3.588119        6.069086        8.017814

These files can be huge (GBs) and I'm only interested in the numerical data. All data blocks have the same size -- 6 in this example -- and they are always separated by two lines. So the stride of the blocks is 8.

Using the first approach:

First I'm going to define a generator that filters out the undesired lines:

def filter_lines(f, stride):
    for i, line in enumerate(f):
        if i%stride and (i-1)%stride:
            yield line

Then I open the file, create a filter_lines-generator (here I need to know the stride), and pass that generator to genfromtxt:

with open(fname) as f:
    data = np.genfromtxt(filter_lines(f, 8),
                         dtype='f',
                         usecols=(1, 2, 3))

This works like a breeze. Note that I'm able to use usecols to get rid of the first column of the data. In the same way, you could use all the other features of genfromtxt -- detecting the types, varying types from column to column, missing values, converters, etc.

In this example data.shape was (204000, 3) while the original file consisted of 272000 lines.

Here the generator is used to filter homogenously strided lines but one can likewise imagine it filtering out inhomogenous blocks of lines based on (simple) criteria.

Using the second approach:

Here's the regexp I'm going to use:

regexp = r'\s+\w+' + r'\s+([-.0-9]+)' * 3 + r'\s*\n'

Groups -- i.e. inside () -- define the tokens to be extracted from a given line. Next, fromregex does the job and ignores lines not matching the pattern:

data = np.fromregex(fname, regexp, dtype='f')

The result is exactly the same as in the first approach.

This is fantastic, thanks. So if I understand correctly, the main benefit of using the generator instead of the direct file path is that we save space storing the data. I'm assuming it's unavoidable reading in the data since that's the only way we can know if we want to filter it out or not. Will this speed loading files up significantly? Or will it actually be slower since we need to first filter and then pass that iterable to genfromtxt? — Lucas, Apr 14 '14 at 15:19
I understand this answer is back from 2013, so I am not sure if things have changed. At this moment I can't find anything about generators in the documentation. Best guess: converters, but they seem to provided a limited functionality compared to this answer's. Does this still work? thanks — divmermarlav, Jun 11 '16 at 20:04

m.brindley · Answer 2 · 2013-02-01T13:03:15.530

1

If you pass a list of types (the format condition), use a try block and use yield to use genfromtxt as a generator, we should be able to replicate textscan().

def genfromtext(fname, formatTypes):
    with open(fname, 'r') as file:
        for line in file:
            try:
                line = line.split(',')  # Do you care about line anymore?
                r = []
                for type, cell in zip(formatTypes, line):
                    r.append(type(cell))
            except:
                pass  # Fail silently on this line since we hit an error
            yield r

Edit: I forgot the except block. It runs okay now and you can use genfromtext as a generator like so (using a random CSV log I have sitting around):

>>> a = genfromtext('log.txt', [str, str, str, int])
>>> a.next()
['10.10.9.45', ' 2013/01/17 16:29:26', '00:00:36', 0]
>>> a.next()
['10.10.9.45', ' 2013/01/17 16:22:20', '00:08:14', 0]
>>> a.next()
['10.10.9.45', ' 2013/01/17 16:31:05', '00:00:11', 3]

I should probably note that I'm using zip to zip together the comma split line and the formatSpec which will tuplify the two lists (stopping when one of the lists runs out of items) so we can iterate over them together, avoiding a loop dependent on len(line) or something like that.

edited Feb 01 '13 at 13:03

answered Feb 01 '13 at 12:49

m.brindley

1,218
10
19

You answer is relevant, but what I really wanted is to solve my problem with available library functions as efficient and powerful as genfromtxt. A custom implementation — either filtering version (in Q) or `textscan` version (in your A) — suffer from several problems I listed above. – Roman Shapovalov Feb 01 '13 at 15:46
What optimisations does textscan offer above this besides being a standard function? It's relatively simple to add a block to skip leading lines and yield only N lines from the file, conversion bails out of a line as soon as reasonably possible (you could even remove conversion and return the original string to save on `append`s), and the whole thing is extremely lazy so memory use is minimal. You could potentially use linecache to improve access speed at the cost of memory usage but I can't comment much on its use. – m.brindley Feb 01 '13 at 21:20
I accept your answer, since you answered the question, indeed. Although I wanted to know if there is a library implementation, where one does not need to implement all the features `np.genfromtxt` has. Most likely, there is not one, so your answer is the best fit. – Roman Shapovalov Feb 06 '13 at 10:19

score 0 · Answer 3 · answered Feb 01 '13 at 14:52

Trying to demonstrate comment to OP.

def fread(name, cond):
    with open(name) as file:
        for line in file:
            if cond(line):
                yield line.split()

def a_genfromtxt_cond(fname, cond=(lambda str: True)):
    """Seems to work without need to convert to float."""
    return np.array(list(fread(fname, cond)), dtype=np.float64)

def b_genfromtxt_cond(fname, cond=(lambda str: True)):
    r = [[int(float(i)) for i in l] for l in fread(fname, cond)]
    return np.array(r, dtype=np.integer)


a = a_genfromtxt_cond("tar.data")
print a
aa = b_genfromtxt_cond("tar.data")
print aa

Output

[[ 1.   2.3  4.5]
 [ 4.7  9.2  6.7]
 [ 4.7  1.8  4.3]]
[[1 2 4]
 [4 9 6]
 [4 1 4]]

Also see `numpy.fromiter` if you're worried about reducing memory usage (i.e. no intermediate list). It's for 1D arrays, but you can make it work with 2D without too much trouble. Not to plug my own answer, but see the last example here: http://stackoverflow.com/a/8964779/325565 — Joe Kington, Feb 01 '13 at 23:44

NumPy reading file with filtering lines on the fly

3 Answers3

Using the first approach:

Using the second approach:

Linked