3

I have a text file fo several GB with this format

0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 273 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 273 593868.39 6734999.90 124.44 1,
0 273 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,

I have a simple function to filter in Python 2.7 on Windows. The function reads the entire file, selects the line with the same idtile (first and second column) and returns the list of points (x,y,z, and label) and the idtile.

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns
idtile = tiles_id[0]

def file_filter(name,idtile):
        lst = []
        for line in file(name, mode="r"):
            element = line.split() # add value
            if (int(element[0]),int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
        return(lst, dy, dx)

The file is more than 32 GB and the bottle-neck is the reading of the file. I am looking for some suggestions or examples in order to speed up my function (ex: Parallel computing or other approaches).

My solution is to split the text file into tiles (using x and y location). The solution is not elegant and I am looking for an efficient approach.

Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
  • Can you split the file? If you can you may be able to achieve something with a parallel approach, otherwise it's unlikely that will be of help. Also, even then, you'd have to be on an SSD for it to help as hard drives are fastest at linear reads not seeking around – entropy Mar 05 '13 at 12:27
  • @entropy. I did. I splited the file in tiles but my solution it's not elegant. I am looking for a more elegant solution. – Gianni Spear Mar 05 '13 at 12:29
  • 2
    How many times is `file_filter()` being called in your program? – Ber Mar 05 '13 at 12:30
  • 5
    If you want to read this file many times, how about giving the job to a database? – eumiro Mar 05 '13 at 12:30
  • @Ber suppose lts_idtile is a list of idtile, the file_filter() is calling each time for len(lts_idtile). – Gianni Spear Mar 05 '13 at 12:32
  • @eumiro, i don't know a database approach. – Gianni Spear Mar 05 '13 at 12:34
  • @Gianni: the rough idea would be to use *SQL with idx, idy, x, y, z and whatever the '1' is named. the database stores the data "effiently" for repeated retrievals of values, eg something like `SELECT * FROM vals WHERE idx = 0 AND idy = 273`. add the 'power' of an 'index' to that and you have groked yourself another solution. – akira Mar 05 '13 at 12:38
  • How many unique idtile? I suggest you split the text file to many small text files that the filename is the idtile, the content is all the rows that startswiths idtile. If the count of idtile is large, you can use `hash(idtile) % N` as the file name. – HYRY Mar 05 '13 at 12:40
  • dear @akira the file is really huge (more than 32 GB). Is It possible store the data? – Gianni Spear Mar 05 '13 at 12:41
  • dear @Gianni: 32gb of whatever fits onto the little stick that sits within the camera i use to take useless pictures of sun flowers. – akira Mar 05 '13 at 12:42
  • Dear @HYRY about split the text file to many small text files using the idtile. The only approach i can think is reading len(lts_idtile) times the text file and save len(lts_idtile) times. Do you have a good approach to split efficientely the text file? – Gianni Spear Mar 05 '13 at 12:43
  • Dear @akira i wish to see your sun flowers (i came from Tuscany). Could i ask an example about your approach please? – Gianni Spear Mar 05 '13 at 12:45
  • dear @BurhanKhalid. No, this is the problem. – Gianni Spear Mar 05 '13 at 12:50
  • @BurhanKhalid, probably after a 2 million of lines 0 274 return – Gianni Spear Mar 05 '13 at 12:51
  • I am a little bit naive on python . But as a start mid pos = filesize/2 . Start one of ur operations from byte 0 of the file to the mid and the second operation from mid+1 to the end . U could decide on the number of chunks of bytes u wish to operate per operation . But I am not sure whether locking is required . – sameer karjatkar Mar 05 '13 at 12:53
  • for every idtile in the lts_idtile you need to open the file with the name idtile and read it's content. So I suggest you split the large text file into many small binary files, and the content of the binary file can be load by `numpy.fromfile()`. – HYRY Mar 05 '13 at 13:01
  • @sameerkarjatkar If the process is I/O bound that wouldn't help as you'd still read only as fast as the drive could serve you the file – entropy Mar 05 '13 at 13:08
  • I am going to measure times for several of the solutions... Will modify my answer when done. – Ellioh Mar 05 '13 at 14:32
  • Reading file is not a bottleneck! Measured times, file input takes some 10% out of the whole execution time. – Ellioh Mar 05 '13 at 14:49

10 Answers10

1

I suggest you change your code so that you read the big file once and write (temporary) files for each tile id. Something like:

def create_files(name, idtiles=None):
    files = {}
    for line in open(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles is not None and idtile not in idtiles:
             continue
         if idtile not in files:
             files[idtile] = open("tempfile_{}_{}".format(elements[0], elements[1]), "w")
         print >>files[idtile], line
    for f in files.itervalues():
        f.close()
    return files

create_files() will return a {(tilex, tiley): fileobject} dictionary.

A variant that closes the files after writing each line, to work around the "Too many open files" error. This variant returns a {(tilex, tiley: filename} dictionary. Will probably be a bit slower.

def create_files(name, idtiles=None):
    files = {}
    for line in open(name):
         elements = line.split()
         idtile = (int(elements[0]), int(elements[1]))
         if idtiles is not None and idtile not in idtiles:
             continue
         filename = "tempfile_{}_{}".format(elements[0], elements[1])
         files[idtile] = filename
         with open(filename, "a") as f:
             print >>f, line
    return files
codeape
  • 97,830
  • 24
  • 159
  • 188
  • so, you split the large files into smaller files .. something OP did already if i am not mislead. – akira Mar 05 '13 at 12:40
  • @codeape. Return this error message create_files(name) Traceback (most recent call last): File "", line 1, in File "", line 9, in create_files IOError: [Errno 24] Too many open files: 'tempfile_3_340' – Gianni Spear Mar 05 '13 at 13:10
  • 1
    Updated with variant that closes the files after each written line. – codeape Mar 05 '13 at 13:21
  • @codeape Does the name file be sorted? your code need to be a lit be fixed – Gianni Spear Mar 05 '13 at 14:47
  • I am not sure if I understand the last question. I haven't tested the code, so yes, it might need some fixing. Nothing needs to be sorted I believe. – codeape Mar 05 '13 at 15:10
1

Your 'idtile's appear to be in a certain order. That is, the sample data suggests that once you traverse through a certain 'idtile' and hit the next, there is no chance that a line with that 'idtile' will show up again. If this is the case, you may break the for loop once you finish dealing with the 'idtile' you want and hit a different one. Off the top of my head:

loopkiller = false
for line in file(name, mode="r"):
    element = line.split()
    if (int(element[0]),int(element[1])) == idtile:
        lst.append(element[2:])
        dy, dx = int(element[0]),int(element[1])
        loopkiller = true
    elif loopkiller:
        break;

This way, once you are done with a certain 'idtile', you stop; whereas in your example, you keep on reading until the end of the file.

If your idtiles appear in a random order, maybe you could try writing an ordered version of your file first.

Also, evaluating the digits of your idtiles seperately may help you traverse the file faster. Supposing your idtile is a two-tuple of one-digit and three-digit integers, perhaps something along the lines of:

for line in file(name, mode="r"):
    element = line.split()
    if int(element[0][0]) == idtile[0]:
        if element[1][0] == str(idtile[1])[0]:
            if element[1][1] == str(idtile[1])[1]:
                if element[1][2] == str(idtile[1])[2]:
                    dy, dx = int(element[0]),int(element[1])
                else go_forward(walk)
            else go_forward(run)
         else go_forward(sprint)
     else go_forward(warp)
mbaytas
  • 2,676
  • 7
  • 26
  • 37
1

Major rule of speed: Do less.

  • You could create a sorted version of huge.txt and keep track of the positions of the idtitle. So, if you search for (223872, 23239) you know already at which position in the file the given information is and could skip everything before (file.seek). One might argue that this information is somewhat equal to an 'INDEX'.
  • You could use mmap to mmap the file into the ram.
  • You can start writing 'workers' to work on different files / positions
  • You could transfer the given file into something like a SQL-server and use standard SQL to retrieve the data. Yes, transfering the 32gigs of data into the database takes time, but consider it as a kind of preprocessing. After that the database will use whatever means to access the data faster than your approach.

Minor thoughts:

  • You could work with slices instead of line.split() to avoid lots of tiny memory allocations.

Outline of how to use a Database:

Assume you have something like PostgreSQL around:

CREATE TABLE tiles
(
  tile_x integer,
  tile_y integer,
  x double precision,
  y double precision,
  z double precision,
  flag integer
);

Then you could replace all whitespaces in the input file with | and all the , with (to create a nice and shiney .csv) and feed that directly into the database:

 COPY "tiles" from "\full\path\to\huge.txt" WITH DELIMITER "|";

You can then do fancy stuff like this:

SELECT * FROM "tiles";

tile_x | tile_y |     x     |     y      |   z    | flag
-------+--------+-----------+------------+--------+-----
     0 |    274 | 593869.99 | 6734999.96 | 121.83 |    1
     0 |    273 | 593869.51 | 6734999.92 | 121.57 |    1
     0 |    273 | 593869.15 | 6734999.89 | 121.57 |    1
     0 |    273 | 593868.79 | 6734999.86 | 121.65 |    1
     0 |    273 | 593868.44 | 6734999.84 | 121.65 |    1
     0 |    273 |    593869 | 6734999.94 | 124.21 |    1
     0 |    273 | 593868.68 | 6734999.92 | 124.32 |    1
     0 |    273 | 593868.39 |  6734999.9 | 124.44 |    1
     0 |    273 | 593866.94 | 6734999.71 | 121.37 |    1
     0 |    273 | 593868.73 | 6734999.99 | 127.28 |    1 

Or something like this:

SELECT * FROM "tiles" WHERE tile_y > 273;

tile_x | tile_y |     x     |     y      |   z    | flag
-------+--------+-----------+------------+--------+-----
     0 |    274 | 593869.99 | 6734999.96 | 121.83 |    1
akira
  • 6,050
  • 29
  • 37
1

You can convert you filter to a generator function:

def file_filter(name):
        lst = []
        idtile = None
        for line in file(name, mode="r"):
            element = line.split() # add value
            if idtile is None:
               idtile = (int(element[0]), int(element[1]))
            if (int(element[0]), int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
            else:
                yield lst, dx, dy
                lst = []
                idtile = None

This function should return one tuple of list_of_data, dx, dy for each idtile, provided that the file is sorted on idtile

New you can use it like this:

for lst, dx, dy in file_filter('you_name_it'):
    process_tile_data(lst, dx, dy)
Ber
  • 40,356
  • 16
  • 72
  • 88
1

I would suggest to compare the times used for your full reading procedure and for just reading lines and doing nothing to them. If those times are close, the only thing you can really do is to change approach (splitting your files etc.), for what you can probably optimize is data processing time, not file reading time.

I also see two moments in your code that are worth fixing:

with open(name) as f:
    for line in f:
        pass #Here goes the loop body
  1. Use with to explicitly close your file. Your solution should work in CPython, but that depends on implementation and may not be that effective always.

  2. You perform transformation of a string to int twice. It is a relatively slow operation. Remove the second one by reusing the result.

P.S. It looks like an array of depth or height values for a set of points on Earth surface, and the surface is split in tiles. :-)

Ellioh
  • 5,162
  • 2
  • 20
  • 33
  • Thank ellioh. Suggestions to improve are always welcome. Why with open(name) as f: instead of for line in open(name):? – Gianni Spear Mar 05 '13 at 13:06
  • 1
    Fixed a typo. :-) "for line in f", of course. The "with" statement always closes the file. In your code the file is closed when garbage collector removes file object. In CPython with its reference counting that is almost the same, but other flavors of Python may do that other way, and the file will stay open until the object is garbage collected, maybe much later. – Ellioh Mar 05 '13 at 13:09
1

My solution is split the large text file into many small binary file for each idtile. To read the text file faster, you can use pandas:

import pandas as pd
import numpy as np
n = 400000 # read n rows as one block
for df in pd.read_table(large_text_file, sep=" ", comment=",", header=None, chunksize=n):
    for key, g in df.groupby([0, 1]):
        fn = "%d_%d.tmp" % key
            with open(fn, "ab") as f:
                data = g.ix[:, 2:5].values
                data.tofile(f)

Then you can get content of one binary file by:

np.fromfile("0_273.tmp").reshape(-1, 4)
HYRY
  • 94,853
  • 25
  • 187
  • 187
1

You can avoid doing the split() and int() on every line by doing a string comparison instead:

def file_filter(name,idtile):
    lst = []
    id_str = "%d %d " % idtile
    with open(name) as f:
        for line in f:
            if line.startswith(id_str):
                element = line.split() # add value
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
1

Here is some statistics. I'm going to update it as more solutions appear. The following program works with a file that consists of repetitions of the lines from the question.

import sys

def f0(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            pass


"""def f0(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            line.split()"""


def f1(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            if (int(element[0]),int(element[1])) == idtile:
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)


def f2(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            pdx, pdy = int(element[0]),int(element[1])
            if (pdx, pdy) == idtile:
                lst.append(element[2:])
                dy, dx = pdx, pdy
    return(lst, dy, dx)

def f2(name, idtile):
    lst = []
    dx, dy = None, None
    with open(name) as f:
        for line in f:
            element = line.split() # add value
            pdx, pdy = int(element[0]),int(element[1])
            if (pdx, pdy) == idtile:
                lst.append(element[2:])
                dy, dx = pdx, pdy
    return(lst, dy, dx)


def f3(name,idtile):
    lst = []
    id_str = "%d %d " % idtile
    with open(name) as f:
        for line in f:
            if line.startswith(id_str):
                element = line.split() # add value
                lst.append(element[2:])
                dy, dx = int(element[0]),int(element[1])
    return(lst, dy, dx)

functions = [f0, f1, f2, f3]

functions[int(sys.argv[1])]("./srcdata.txt",(0, 273))

The shell script for timing is straightforward:

#!/bin/sh
time python ./timing.py 0
time python ./timing.py 1
time python ./timing.py 2

I prefer to run it that way to avoid previously run functions to have influence on the time of the others.

And the result is:

0.02user 0.01system 0:00.03elapsed
0.42user 0.01system 0:00.44elapsed
0.32user 0.02system 0:00.34elapsed
0.33user 0.01system 0:00.35elapsed

The good news: reading file is NOT a bottleneck, removing extra transfer to int is effective.

The bad news: I still do not now how to optimize it significantly.

Ellioh
  • 5,162
  • 2
  • 20
  • 33
  • reading file is not a bottleneck, the problem is you read the files more than 1 milion of times – Gianni Spear Mar 05 '13 at 14:54
  • probably i need to sort the file before and slice idtiles by idtiles – Gianni Spear Mar 05 '13 at 14:56
  • Yes, then it should be sliced. In fact, it's the only way, I think. – Ellioh Mar 05 '13 at 15:01
  • I am open another question on OS. I will put the link if you like to be part – Gianni Spear Mar 05 '13 at 15:03
  • Yes, at least I would like to have a look. – Ellioh Mar 05 '13 at 15:03
  • posted http://stackoverflow.com/questions/15227254/efficent-way-to-sorting-a-large-text-file-in-python – Gianni Spear Mar 05 '13 at 15:11
  • In the timings you are looking for `(0, 273)` which is 90 % of the file. I believe @Gianni is trying to extract a very small fraction of the file. Could you re-run the timings using an id that is _not_ in the file? That would be more representative IMO. – Janne Karila Mar 06 '13 at 06:56
  • @Janne Karilla, that's true, the problem really is with reading the file once per tile id. However, the problem seems to be solved by splitting the file by tile id's into many files. Follow a link in the previous Gianni's comment. – Ellioh Mar 06 '13 at 07:36
1

Maybe the best and fasted was to solve you problem is using a map reduce algorithm on a (massively) parallel system.

Ber
  • 40,356
  • 16
  • 72
  • 88
0

Okay. If you need to do this many times, you obviously need to create some sort of an index. But if this not a frequent activity the best bet would be to multithread it like this.

NUMWORKERS = 8
workerlist = []
workQ = Queue.Queue()

def file_filter(name,idtile, numworkers):
    for i in range(NUMWORKERS):
        worker = threading.Thread(target=workerThread, args=(lst,))
    lst = []
    for line in file(name, mode="r"):
        workQ.put(line)                
    for i in range(NUMWORKERS):
        workQ.put(None)
    workQ.join()
    return lst , idtile[0], idtile[1]

def workerThread(lst):
    line = workQ.get()
    if not line:
        return
    element = line.split() # add value
    if (int(element[0]),int(element[1])) == idtile:
        lst.append(element[2:])

In case this is a very frequent activity that happens for each idtile then the solution would be drastically different. Doing this for a number of idtiles together will give you best amortized performance. Because any number of idtiles already known can be processed in a single loop over the file.

saketrp
  • 2,323
  • 3
  • 16
  • 18
  • This is advantageous because your IO thread is not wasting its time doing computation. Which would mean your IO goes at the maximum possible speed. – saketrp Mar 05 '13 at 13:23
  • I don't believe this may help. Even more, I would bet this is much slower than doing it in ordinary way, at least in CPython. – Ellioh Mar 05 '13 at 14:25