Extracting a random line in a file without loading the file into RAM in python

Question

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.

I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.

I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.

Could someone give me some hints? Thank you.

EDIT : forgot to say that I know the number of lines in my files beforehand.

score 6 · Accepted Answer · answered Mar 02 '14 at 17:07

6

You can use a heapq to select n records based on a random number, eg:

import heapq
import random

SIZE = 10
with open('yourfile') as fin:
    sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())

This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.

answered Mar 02 '14 at 17:07

Jon Clements

138,671
33
247
280

I forgot to say that I know the number of lines in my files, wouldn't I be able to take advantage of that? – AdrienNK Mar 02 '14 at 17:16
@AdrienNK the only advantage would be if your lines were of fixed length such that you could seek directly to each record after generating *n* integers between 0 and *lines in file*... Otherwise, it offers no advantage... – Jon Clements Mar 02 '14 at 17:18
Ok, thank you for your answer. I'm still not sure exactly what is considered as a record but I'll look into that! – AdrienNK Mar 02 '14 at 17:22
I thought the intention was to avoid reading the whole file. Won't that I/O time tend to dominate if the files are large enough that loading them into memory is an issue? – holdenweb Mar 03 '14 at 06:59
@holdenweb I don't know... I was working on the *but I want to load the less possible information in RAM.* and this does that (and for all we know - they're running off SSDs - so i/o time isn't necessarily such a burden)... I have mentioned that seeking is practical if lines are of a fixed length, and I just noticed that you've introduced an example of seek and find a line for where they're not. I think if performance becomes an issue, then it's down to the OP to try alternate solutions based on their specific circumstance. – Jon Clements Mar 03 '14 at 07:08
Indeed, and the more alternatives to choose from the better. Having studied your solution and considered my own again I added a comment reminding people that my suggested method isn't Unicode-safe. – holdenweb Mar 03 '14 at 07:12

score 3 · Answer 2 · answered Mar 02 '14 at 17:34

One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.

import random
import os
import glob

for name in glob.glob("*.py"):
    mode, ino, den, nlink, uid, gid, size, atime,  mtime, ctime = os.stat(name)
    inf = open(name, "r")
    location = random.randint(0, size)
    inf.seek(location)
    while location > 0:
        char = inf.read(1)
        if char == "\n":
            break
        location -= 1
        inf.seek(location)
    line = inf.readline()
    print name, ":", line[:-1]

As long as the lines aren't huge this shouldn't be unduly burdensome.

This solution is fine (in my opinion, anyway) in an ASCII world, but the vagaries of multi-byte variable-length encodings mean that it cannot be considered reliable when Unicode is in use. This code is also not portable to Python 3 as presented. — holdenweb, Mar 03 '14 at 07:02

score 1 · Answer 3 · edited May 23 '17 at 10:29

1

You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.

Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.

edited May 23 '17 at 10:29

Community

1
1

answered Mar 02 '14 at 17:05

sjr

9,769
1
25
36

Wouldn't re-reading the file be extremely unefficient with big files? (I'm talking about >10Gb files) – AdrienNK Mar 02 '14 at 17:16
ya in that case you should use the reservoir sampling technique probably – sjr Mar 02 '14 at 19:42

Extracting a random line in a file without loading the file into RAM in python

3 Answers3