6

I have a huge file (around 30GB), each line includes coordination of a point on a 2D surface. I need to load the file into Numpy array: points = np.empty((0, 2)), and apply scipy.spatial.ConvexHull over it. Since the size of the file is very large I couldn't load it at once into the memory, I want to load it as batch of N lines and apply scipy.spatial.ConvexHull on the small part and then load the next N rows! What's an efficient to do it?
I found out that in python you can use islice to read N lines of a file but the problem is lines_gen is a generator object, which gives you each line of the file and should be used in a loop, so I am not sure how can I convert the lines_gen into Numpy array in an efficient way?

from itertools import islice
with open(input, 'r') as infile:
    lines_gen = islice(infile, N)

My input file:

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0
Community
  • 1
  • 1
Am1rr3zA
  • 7,115
  • 18
  • 83
  • 125

4 Answers4

4

With your data, I can read it in 5 line chunks like this:

In [182]: from itertools import islice
with open(input,'r') as infile:
    while True:
        gen = islice(infile,N)
        arr = np.genfromtxt(gen, dtype=None)
        print arr
        if arr.shape[0]<N:
            break
   .....:             
[(0.989703, 1) (0.0, 0) (0.0102975, 0) (0.0102975, 0) (1.0, 1)]
[(0.989703, 1) (1.0, 1) (0.0, 0) (0.0102975, 0) (0.989703, 1)]
[(0.979405, 1) (0.0, 0) (0.020595, 0) (0.020595, 0) (1.0, 1)]
[(0.979405, 1) (1.0, 1) (0.0, 0) (0.020595, 0) (0.979405, 1)]
[(0.969108, 1) (0.0, 0) (0.0308924, 0) (0.0308924, 0) (1.0, 1)]
[(0.969108, 1) (1.0, 1) (0.0, 0) (0.0308924, 0) (0.969108, 1)]
[(0.95881, 1) (0.0, 0)]

The same thing read as one chunk is:

In [183]: with open(input,'r') as infile:
    arr = np.genfromtxt(infile, dtype=None)
   .....:     
In [184]: arr
Out[184]: 
array([(0.989703, 1), (0.0, 0), (0.0102975, 0), (0.0102975, 0), (1.0, 1),
       (0.989703, 1), (1.0, 1), (0.0, 0), (0.0102975, 0), (0.989703, 1),
       (0.979405, 1), (0.0, 0), (0.020595, 0), (0.020595, 0), (1.0, 1),
       (0.979405, 1), (1.0, 1), (0.0, 0), (0.020595, 0), (0.979405, 1),
       (0.969108, 1), (0.0, 0), (0.0308924, 0), (0.0308924, 0), (1.0, 1),
       (0.969108, 1), (1.0, 1), (0.0, 0), (0.0308924, 0), (0.969108, 1),
       (0.95881, 1), (0.0, 0)], 
      dtype=[('f0', '<f8'), ('f1', '<i4')])

(This is in Python 2.7; in 3 there's a byte/string issue I need to work around).

Alex Fortin
  • 2,105
  • 1
  • 18
  • 27
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • I need it in 2.7, thanks for your answer let me check if everything works fine for me – Am1rr3zA Mar 16 '15 at 00:08
  • If you need more speed, `pandas` has a faster version of `genfromtxt`. The `np` version is pure Python, processing its input line by line. – hpaulj Mar 16 '15 at 00:12
  • TBH my biggest problem is computing the ConvexHull since the data is big I can't load all in the memory so I neeed to find a way to compute a Convexhull chunk by chunk and do something over all of them, maybe I should go and use pure qhull – Am1rr3zA Mar 16 '15 at 00:14
  • May you need a new question, specifically about `ConvexHull`. I've worked a little with this function, but way back, and not with very large sets. – hpaulj Mar 16 '15 at 00:30
  • as you said the code is very slow, I tried Panda.read_csv and np.fromfile(gen, dtype=np.float32, sep="\t") which are much faster but the problem is the don't accept generator as their first argument they want the exact file path – Am1rr3zA Mar 16 '15 at 00:44
1

You could try the second method from this post and read the file in chunks by referring to a given line using a pre-computed lines offset array if it fits into memory. Here is an example of what I typically use to avoid loading whole files to memory::

data_file = open("data_file.txt", "rb") 

line_offset = []
offset = 0

while 1:
    lines = data_file.readlines(100000)
    if not lines:
        break

    for line in lines:
        line_offset.append(offset)
        offset += len(line)

# reading a line
line_to_read = 1
line = ''

data_file.seek(line_offset[line_to_read])   
line = data_file.readline() 
Community
  • 1
  • 1
Mr. Girgitt
  • 2,853
  • 1
  • 19
  • 22
  • In your case in the while loop you should additionally count lines and use the total lines count number to prepare chunks to be read. – Mr. Girgitt Mar 15 '15 at 23:55
  • 1
    I couldn't understand your code very well, and when I tried to run it; it took so much time, and it didn't solve my problem to convert the output into 2D-numoy array (in a efficient way) – Am1rr3zA Mar 16 '15 at 00:07
  • 1
    I just showed how to read the input file in chunks not how to process it in chunks in numpy. I made a test code and my conclusion is the problem will not fit in memory anyway - 300MB test file requires 180MB RAM just for the look up table to reference to lines - 30GB file would need 18GB just for that. Simple python array requires 1,2GB RAM for data from such 300MB file (10 mln of lines). Your problem needs decomposition. – Mr. Girgitt Mar 16 '15 at 11:59
  • My tests for building single numpy array with numpy.vstack and numpy.concatenate show worse than O(n) time complexity. Building a single 1 mld items array this way would take too long. – Mr. Girgitt Mar 16 '15 at 12:42
  • yes numpy.vstack is very slow, but thanks anyway for your answer – Am1rr3zA Mar 16 '15 at 17:15
  • It looks like the scipy.spatial.ConvexHull supports incremental mode (http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.spatial.ConvexHull.html). You could try a naive approach (it your task is a one-off) and use my example or other examples presented here to read chunks of the input file and incrementally compute the hull. Such a single-threaded approach would be slow.. – Mr. Girgitt Mar 16 '15 at 20:44
  • A better approach is to prepare chunks of data, then compute in parallel (in sub-processes) decomposed convex hulls then gather all points from the partial hulls for a final computation. Aggregation following decomposition can can be done in more than a single step if input data reduction through decomposition still produces too much data points to process in the final computation.. – Mr. Girgitt Mar 16 '15 at 20:57
1

you can define a chunk reader as follows using a generator

def read_file_chunk(fname, chunksize=500000):
    with open(fname, 'r') as myfile:
        lines = []
        for i, line in enumerate(myfile):
            line_values = (float(val) for val in line.split())
            lines.append(line_values)
            if i > 0 and i % 5 == 0:
                yield lines
                lines = [] # resets the lines list
        if lines:
            yield lines # final few lines of file.

# and, assuming the function you want to apply is called `my_func`
chunk_gen = read_file_chunk(my_file_name)
for chunk in chunk_gen:
    my_func(chunk)
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
  • As I said in my question I need to convert the output to numpy array I already can read chunk of file but the problem is I need to write for over them and read the output line by line and create my numpy array which is not efficient at all. to be more clear in your code I don't want to my_func(chunk) over each line I want to apply my function over all the chunk entirely – Am1rr3zA Mar 16 '15 at 00:01
  • to clarify, the question isn't about reading a file in chunks, rather about mapping `ConvexHull` on chunked numpy matrices and then reducing it somehow? – Haleemur Ali Mar 16 '15 at 02:31
  • not exactly, as you can see above I need something like @hpaulj answer but np.genfromtxt(gen, dtype=None)it's very slow and I need to find a faster way to read the data. computing ConvexHull is another challenge that I should think about it after I solve my current problem. – Am1rr3zA Mar 16 '15 at 03:37
0

You can look onto DAGpype‘s chunk_stream_bytes. I didn't worked with it, but I hope it will help.

This is example of chunk read and processing some .csv file (_f_name):

 np.chunk_stream_bytes(_f_name, num_cols = 2) | \
        filt(lambda a : a[logical_and(a[:, 0] < 10, a[:, 1] < 10), :]) | \
        np.corr()
Denti
  • 424
  • 1
  • 5
  • 12