Efficient way to find particular rows with Blaze package?

Question

I have a data table that has ~74 million lines that I used blaze to load it.

from blaze import CSV, data  
csv = CSV('train.csv')
t = data(csv)

It has fields these: A, B, C, D, E, F, G

Since this is such a large dataframe, how can I efficiently output rows that fit specific criteria? For example, I would want rows that have A==4, B==8, E==10. Is there a way to multitask the look-up? For example, by threading or parallel programming or something?

By parallel programming I mean for example, one thread will try to find the matching row from row 1 to row 100000, and the second thread will try to find the matching row from row 100001 to 200000, and so on...

Have you looked at http://stackoverflow.com/questions/27505764/pydata-blaze-does-it-allow-parallel-processing-or-not I don't know blaze, but the answer here seems to point in the direction you want — saulspatz, Aug 22 '16 at 19:38

Alexander · Answer 1 · 2016-08-22T20:00:36.923

Your selection criteria is quite simple:

t[(t.A == 4) & (t.B == 8) & (t.E == 10)]

Using the readily available iris sample dataset as an example:

from blaze import data
from blaze.utils import example
iris = data(example('iris.csv'))

iris[(iris.sepal_length == 7) & (iris.petal_length > 2)]
    sepal_length  sepal_width  petal_length  petal_width          species
50             7          3.2           4.7          1.4  Iris-versicolor

The docs discuss parallel processing in Blaze.

Note that one can only parallelize over datasets that can be easily split in a non-serial fashion. In particular one can not parallelize computation over a single CSV file. Collections of CSV files and binary storage systems like HDF5 and BColz all support multiprocessing.

Showing that the timings are approximately the same on a single csv file when using multiprocessing:

import multiprocessing
pool = multiprocessing.Pool(4)

%timeit -n 1000 compute(iris[(iris.sepal_length > 7) & (iris.petal_length > 2)], 
                        map=pool.map)
1000 loops, best of 1: 12.1 ms per loop

%timeit -n 1000 compute(iris[(iris.sepal_length > 7) & (iris.petal_length > 2)])
1000 loops, best of 1: 11.7 ms per loop

Efficient way to find particular rows with Blaze package?

1 Answers1