Efficent way to split a large text file in python

Question

this is a previous question where to improve the time performance of a function in python i need to find an efficient way to split my text file

I have the following text file (more than 32 GB) not sorted

....................
0 274 593869.99 6734999.96 121.83 1,
0 273 593869.51 6734999.92 121.57 1,
0 273 593869.15 6734999.89 121.57 1,
0 273 593868.79 6734999.86 121.65 1,
0 272 593868.44 6734999.84 121.65 1,
0 273 593869.00 6734999.94 124.21 1,
0 273 593868.68 6734999.92 124.32 1,
0 274 593868.39 6734999.90 124.44 1,
0 275 593866.94 6734999.71 121.37 1,
0 273 593868.73 6734999.99 127.28 1,
.............................

the first and second columns are the ID (ex: 0 -273) of location of the x,y,z point in a grid.

def point_grid_id(x,y,minx,maxy,distx,disty):
    """give id (row,col)"""
    col = int((x - minx)/distx)
    row = int((maxy - y)/disty)
    return (row, col)

the (minx, maxx) is the origin of my grid with size distx,disty. The the numbers of Id tiles are

tiles_id = [j for j in np.ndindex(ny, nx)] #ny = number of row, nx= number of columns 
from [(0,0),(0,1),(0,2),...,(ny-1,nx-1)]
n = len(tiles_id)

I need to slice the ~32 GB file in n (= len(tiles_id)) numbers of files.

i can do this without sorting but reading n times the file. For this reason I wish to find an efficient splitting method for the file starting form (0,0) (= tiles_id[0]) . After that i can read only one time the splitted files.

Not sure how efficient you can really get with Python for sorting a file of that size. — Tony The Lion, Mar 05 '13 at 15:15
Well I hope you've got something to read while it will run, then. — Bartek Banachewicz, Mar 05 '13 at 15:15
just free Weekend hiking in the mounts and far from the PC :). I know Python is slow, but if the algorithm takes some days it's ok — Gianni Spear, Mar 05 '13 at 15:17
At times I do miss the Big Iron for these JOBs. Those Dinosaurs make you feel that its a cake walk. IBM SORT and the Extension ICETOOL are made for these JOBS. I have also sorted TBs of data effortlessly :-) — Abhijit, Mar 05 '13 at 15:27
@Gianni: `I LOVE Python!! `, Yeah I can see, LOVE is so blind that you can't see its weakness — Abhijit, Mar 05 '13 at 15:29
@Abhijit and Bartek Banachewicz "Life's too short to code in C++" :) — Gianni Spear, Mar 05 '13 at 16:39
@Abhijit, personally I see no demonstration of Python weaknesses here, though I admit they really exist. The problem is a good example of a task where performance should be optimized by optimizing algorithms, not the code. And only if the result is still not satisfactory after algorithmic optimizations, there is a serious reason to optimize the code, probably even rewriting it in C++. — Ellioh, Mar 06 '13 at 07:40

Ellioh · Accepted Answer · 2013-03-06T13:10:46.087

5

Sorting is hardly possible for a 32GB file, no matter if you use Python or a command line tool (sort). Databases seem too powerful, but may be used. However, if you are unwilling to use databases, I would suggest simply splitting the source file in files using the tile id.

You read a line, make a file name out of a tile id and append the line to the file. And continue that until the source file is finished. It is not going to be too fast, but at least it has a complexity of O(N) unlike sorting.

And, of course, individual sorting of files and concatenating them is possible. The main bottleneck in sorting a 32GB file should be memory, not CPU.

Here it is, I think:

def temp_file_name(l):
    id0, id1 = l.split()[:2]
    return "tile_%s_%s.tmp" % (id0, id1)

def split_file(name):
    ofiles = {}
    try:
        with open(name) as f:
            for l in f:
                if l:
                    fn = temp_file_name(l)
                    if fn not in ofiles:
                        ofiles[fn] = open(fn, 'w')
                    ofiles[fn].write(l)
    finally:
        for of in ofiles.itervalues():
            of.close()

split_file('srcdata1.txt')

But if there is a lot of tiles, more than number of files you can open, you may do so:

def split_file(name):
    with open(name) as f:
        for l in f:
            if l:
                fn = temp_file_name(l)
                with open(fn, 'a') as of:
                    of.write(l)

And the most perfectionist way is to close some files and remove them from dictionary after reaching a limit on open files number.

edited Mar 06 '13 at 13:10

answered Mar 05 '13 at 15:18

Ellioh

5,162
2
20
33

And then you can sort the individual files easily. – Tim Pietzcker Mar 05 '13 at 15:19
but do i need read only one time? – Gianni Spear Mar 05 '13 at 15:19
3

Sure it is possible. You'd have to use a multi-file sort and merge, but it's possible. – Martijn Pieters Mar 05 '13 at 15:19
Do you really need SORTING? You rather need splitting into tiles, unless I'm wrong. Splitting is easy. But if full sorting is desired, you will be able to sort each file and concatenate them. – Ellioh Mar 05 '13 at 15:21
Dear Ellioh and Martijn, Thanks for help your guys are always great. Can be interesting try all solution. Sorting is not required if i can split reading-splitting the file only one time. The sorting idea was an approach to avoid to read the file n=len(tiles_id) times – Gianni Spear Mar 05 '13 at 15:23
@Ellioh, could i ask a code example please. I don't want abuse of your time. (it helps me to study how other people code). thanks – Gianni Spear Mar 05 '13 at 15:30
image the first line is ['451', '35', '592678.75', '6732741.20', '106.20', '1'] and the output file is "tempfile_{}_{}".format(element[0], element[1]) 'tempfile_451_35' – Gianni Spear Mar 05 '13 at 15:31
The filename seems ok. Why not? – Ellioh Mar 05 '13 at 15:42
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/25609/discussion-between-ellioh-and-gianni) – Ellioh Mar 05 '13 at 15:42
@Ellioh i think there is an open connection to close – Gianni Spear Mar 05 '13 at 22:42
@Gianni, do you mean an open file? There should not be. with always closes files. What makes you suspecting that? – Ellioh Mar 06 '13 at 06:50
@Ellioh. No just my mistake. Please, could you explain why "if l:"?. Can we skip that line?. Thanks! – Gianni Spear Mar 06 '13 at 11:57
1

My sample file included an empty line. :-) If yours cannot, you may omit "if l:" safely. – Ellioh Mar 06 '13 at 12:55

score 1 · Answer 2 · answered Mar 05 '13 at 15:23

A quick google lead me to this recipe in ActiveState code. It didn;t gave any performance comparison but it seems to do the JOB.

In short, it seems to do what @Ellioh suggested, and you have a ready made recipe and you may not have to reinvent the wheel.

Efficent way to split a large text file in python

2 Answers2

Linked