1

I have big svmlight files that I'm using for machine learning purpose. I'm trying to see if a sumsampling of those files would lead to good enough results.

I want to extract random lines of my files to feed them into my models but I want to load the less possible information in RAM.

I saw here (Read a number of random lines from a file in Python) that I could use linecache but all the solution end up loading everything in memory.

Could someone give me some hints? Thank you.

EDIT : forgot to say that I know the number of lines in my files beforehand.

Community
  • 1
  • 1
AdrienNK
  • 850
  • 8
  • 19

3 Answers3

6

You can use a heapq to select n records based on a random number, eg:

import heapq
import random

SIZE = 10
with open('yourfile') as fin:
    sample = heapq.nlargest(SIZE, fin, key=lambda L: random.random())

This is remarkably efficient as the heapq remains fixed size, it doesn't require a pre-scan of the data and elements get swapped out as other elements get chosen instead - so at most you'll end up with SIZE elements in memory at once.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I forgot to say that I know the number of lines in my files, wouldn't I be able to take advantage of that? – AdrienNK Mar 02 '14 at 17:16
  • @AdrienNK the only advantage would be if your lines were of fixed length such that you could seek directly to each record after generating *n* integers between 0 and *lines in file*... Otherwise, it offers no advantage... – Jon Clements Mar 02 '14 at 17:18
  • Ok, thank you for your answer. I'm still not sure exactly what is considered as a record but I'll look into that! – AdrienNK Mar 02 '14 at 17:22
  • I thought the intention was to avoid reading the whole file. Won't that I/O time tend to dominate if the files are large enough that loading them into memory is an issue? – holdenweb Mar 03 '14 at 06:59
  • @holdenweb I don't know... I was working on the *but I want to load the less possible information in RAM.* and this does that (and for all we know - they're running off SSDs - so i/o time isn't necessarily such a burden)... I have mentioned that seeking is practical if lines are of a fixed length, and I just noticed that you've introduced an example of seek and find a line for where they're not. I think if performance becomes an issue, then it's down to the OP to try alternate solutions based on their specific circumstance. – Jon Clements Mar 03 '14 at 07:08
  • Indeed, and the more alternatives to choose from the better. Having studied your solution and considered my own again I added a comment reminding people that my suggested method isn't Unicode-safe. – holdenweb Mar 03 '14 at 07:12
3

One option is to do a random seek into the file then look backwards for a newline (or the start of the file) before reading a line. Here's a program that prints a random line of each of the Python programs it finds in the current directory.

import random
import os
import glob

for name in glob.glob("*.py"):
    mode, ino, den, nlink, uid, gid, size, atime,  mtime, ctime = os.stat(name)
    inf = open(name, "r")
    location = random.randint(0, size)
    inf.seek(location)
    while location > 0:
        char = inf.read(1)
        if char == "\n":
            break
        location -= 1
        inf.seek(location)
    line = inf.readline()
    print name, ":", line[:-1]

As long as the lines aren't huge this shouldn't be unduly burdensome.

holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • 1
    This solution is fine (in my opinion, anyway) in an ASCII world, but the vagaries of multi-byte variable-length encodings mean that it cannot be considered reliable when Unicode is in use. This code is also not portable to Python 3 as presented. – holdenweb Mar 03 '14 at 07:02
1

You could scan the file once, counting the number of lines. Once you know that, you can generate the random line number, re-read the file and emit that line when you see it.

Actually since you're interested in multiple lines, you should look at Efficiently selecting a set of random elements from a linked list.

Community
  • 1
  • 1
sjr
  • 9,769
  • 1
  • 25
  • 36