1

I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is:

data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n

There's no header, and there's no id to uniquely identify the rows.

Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.

Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?

Luca Massaron
  • 1,734
  • 18
  • 25
  • afaik you cant .... you could seek to a random point in the file and get the next line probably (allowing you to grab random lines ... but I dont think that method would work very well for writing back out to the file) – Joran Beasley Oct 10 '13 at 19:14
  • Are all lines the same length? – abarnert Oct 10 '13 at 19:29
  • @ abarnert : No, the lines are about the same length, but not exactly the same length (due to the rounding of floating numbers and because of ids of different lengths) – Luca Massaron Oct 10 '13 at 19:37
  • Does this help: [How can I shuffle the lines of a text file in Unix command line?](http://stackoverflow.com/questions/2153882/how-can-i-shuffle-the-lines-of-a-text-file-in-unix-command-line) - it has some python examples though not highly voted. – Aditya Oct 10 '13 at 19:38
  • I was actually thinking of splitting the file into a large number of chuncks, each chunck of different random row length, but still manageble into the memory, and thus shuffle each one, then recompose the file randomly arranging the chuncks. Then I would repeat the procedure a few times. The more times I will repeat the procedure, the better the shuffle, though I wonder if there's a smarter way in fewer passes on the disk. – Luca Massaron Oct 10 '13 at 19:42
  • @LucaM: How many rows do you have? It may be impractical to hold 100M 80-character strings in memory, but not 100M 4-byte integers… – abarnert Oct 10 '13 at 19:50
  • @abarnert : I have about 12 million rows, a single row is on average 1.255 byte (yes, the file is almost over 15 gigabytes). I found the solutions you proposed quite impressive, most likely I expect the first one (using a tmp db) to perform better. Let me try them and be back to report to you. Thank you very much. – Luca Massaron Oct 10 '13 at 20:13
  • @LucaM: OK, that's exactly what I was hoping. 15GB is a problem; 48MB isn't. If it were 1000M rows averaging 4 bytes long, my solutions would have been mostly useless… – abarnert Oct 10 '13 at 20:30
  • Could you `zlib.compress(line, 9)` a few lines and let us know what kind of compression ratio you're seeing? Text compresses extremely well, and you might be lucky and get a 90% compression ratio, which means your in-memory compressed lines might comfortably fit into RAM. Then you can `zlines.shuffle()` and write them back out to a new file with `zlib.decompress()`. – Caleb Hattingh Oct 11 '13 at 04:00

4 Answers4

6

All but one of these ideas use O(N) memory—but if you use an array.array or numpy.ndarray we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)


Using a temporary database and an index list:

with contextlib.closing(dbm.open('temp.db', 'n')) as db:
    with open(path) as f:
        for i, line in enumerate(f):
            db[str(i)] = line
    linecount = i
    shuffled = random.shuffle(range(linecount))
    with open(path + '.shuffled', 'w') as f:
        for i in shuffled:
            f.write(db[str(i)])
os.remove('temp.db')

This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).


If you use a relational database like sqlite3 instead of a dbm, you don't even need the index list, because you can just do this:

SELECT * FROM Lines ORDER BY RANDOM()

This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.


A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:

with open(path) as f:
    linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
    for i in shuffled:
        f.write(linecache.getline(path, i))

Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.

with open(path) as f:
    linestarts = [f.tell() for line in f]
    lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
    shuffled = random.shuffle(lineranges)
    with open(path + '.shuffled', 'w') as f1:
        for start, stop in shuffled:
            f.seek(start)
            f1.write(f.read(stop-start))
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    That linecache option is neat. I'm going to use that in the future. Also, my py 2.7.5. docs say that `random.shuffle` is in-place. – Caleb Hattingh Oct 11 '13 at 04:37
  • @cjrh: Right, sorry about shuffle; you need to assign the range list to a variable and then shuffle that. I'll edit it when I'm not on a phone. Good catch. – abarnert Oct 11 '13 at 21:32
  • Notice that the line numbers in the linecache module start with 1 – o17t H1H' S'k Jun 24 '15 at 09:37
2

This is a suggestion based on my comment above. It relies on having the compressed lines still being able to fit into memory. If that is not the case, the other solutions will be required.

import zlib
from random import shuffle

def heavy_shuffle(filename_in, filename_out):
    with open(filename_in, 'r') as f:
        zlines = [zlib.compress(line, 9) for line in f]
    shuffle(zlines)
    with open(filename_out, 'w') as f:
        for zline in zlines:
            f.write(zlib.decompress(zline) + '\n')

My experience has been that zlib is fast, while bz2 offers better compression, so you may want to compare.

Also, if you can get away with chunking, say, n lines together, doing so is likely to lift your compression ratio.


I was wondering about the likelihood of useful compression, so here's an IPython experiment. I don't know what your data looks like, so I just went with floats (as strings) rounded to 3 places and strung together with pipes:

Best-case scenario (e.g. many rows have all same digits):

In [38]: data = '0.000|'*200

In [39]: len(data)
Out[39]: 1200

In [40]: zdata = zlib.compress(data, 9)

In [41]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio:  0.98

In [42]: bz2data = bz2.compress(data, 9)

In [43]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio:  0.959166666667

As expected, best-case is really good, >95% compression ratio.

Worst-case scenario (randomized data):

In [44]: randdata = '|'.join(['{:.3f}'.format(x) for x in np.random.randn(200)])

In [45]: zdata = zlib.compress(randdata, 9)

In [46]: print 'zlib compression ratio: ',1.-1.*len(zdata)/len(data)
zlib compression ratio:  0.5525

In [47]: bz2data = bz2.compress(randdata, 9)

In [48]: print 'bz2 compression ratio: ',1.-1.*len(bz2data)/len(data)
bz2 compression ratio:  0.5975

Surprisingly, worst-case is not too bad ~60% compression ratio, but likely to be problematic if you only have 8 GB of memory (60% of 15 GB is 9 GB).

Caleb Hattingh
  • 9,005
  • 2
  • 31
  • 44
0

This problem can be thought as a problem of efficient memory pages management to reduce swap file I/O. Let your buffer buf be a list of contigous chunks of file you would like to be stored into the output file. Let a contigous chunk of file be a list of a fixed amount of whole lines.

Now, generate a random sequence and remap the returned values to appriopriate chunk numbers and line offsets inside that chunk.

This operation leaves you with a sequence of numbers [1..num of chunks] which can be described as a sequence of accesses to memory fragments contained in pages of numbers between [1..num of chunks]. For online variantion (like in real OS), there is no optimal strategy to this problem, but since you know the actual sequence of pages' referencing there is an optimal solution which can be found here.

What's the gain from this approach? Pages which are going to be used most often are least reread from the HDD meaning less I/O operations to read the data. Also, considering your chunk size is big enough to minimize page swapping compared to memory footprint, a lot of the times following lines of the output file would be taken from the same chunk stored in memory (or any other, but not yet swapped to the drive) rather than reread from the drive.

Maybe it's not the easiest solution (though the optimal page swapping algorithm is easy to write), it could be a fun exercise to do, wouldn't it?

Maciej Gol
  • 15,394
  • 4
  • 33
  • 51
0

Assuming that disk space is not not a problem with you, I am creating multiple files to hold the data.

import random
import os

PMSize = 100 #Lesser value means using more primary memory
shuffler = lambda x: open(x, 'w')
shufflers = [shuffler('file'+str(x)) for x in range(PMSize)]

with open('filename') as file:
    for line in file:
        i = random.randint(0, len(shufflers)-1)
        shufflers[i].write(line)

with open('filename', 'w') as file:
    for file in shufflers:
        newfile.write(file.read())

for file in shufflers:
    os.remove(file)

Your Memory complexity will be controlled by PMSize. Time complexity will be around O(N + PMSize).

shshank
  • 2,571
  • 1
  • 18
  • 27
  • This won't actually be randomized; row 1 can only show up as row 1, 101, 201, 301, … in the final version. You need to merge-shuffle the shufflers, and/or need to repeat with different PMSize values each time. (If you randomize the PMSize values within each loop, so each shuffler is a different size, it might improve the practical results, although I don't know how to estimate the theoretical benefits of that.) – abarnert Oct 10 '13 at 19:58
  • PMSize is basically the no. of smaller files created. No, the row one will not always be 1, 101, 201, etc. The first for loop begins writing each line to a random file. The second for loop merges all the files. the third deletes all the small files it created. – shshank Oct 10 '13 at 20:30