4

Ok, so I have multiple textfiles, each containing well over 500.000 or even 1.000.000 lines.

Currently I do something like this:

import random

def line_function(line):
    # Do something with given line

def random_itteration(filepath):
    with open(filepath) as f:
        lines = f.readlines()
        random.shuffle(lines)
        for line in lines:
            result = line_function(line)

The thing is that the Python Docs on random.shuffle() clearly state (emphasis added by me):

Note that even for small len(x), the total number of permutations of x can quickly grow larger than the period of most random number generators. This implies that most permutations of a long sequence can never be generated. For example, a sequence of length 2080 is the largest that can fit within the period of the Mersenne Twister random number generator.

So the question is:

What would be the fastest and most efficient way to make my setup work as intended?

Further info:

There is a reason why I want to apply line_function() to a random line and not simply iterate over them in the sequence they are in. Also note that I highly prefer to only process each line once.

Finally, shuffling the textfile up front, or dividing it into smaller files unfortunately isn't an option. And isn't what I am asking.


Any insights are more then welcome! Thnx in advance guys.

Montmons
  • 1,416
  • 13
  • 44
  • 1
    You only need one random permutation, not all of them. If you only need a "random" permutation, the quoted bit is not a problem. If you need a truly random permutation, that's a bigger struggle. – LangeHaare Feb 27 '18 at 17:32
  • 1
    The note is true about any way that you select things randomly from such a large set of inputs. – Barmar Feb 27 '18 at 17:32
  • 1
    Is the file static or does it change frequently. If it's static, you can create another file holding the positions of the start of each line in the file. Read that file, shuffle it, then you can seek in the main file to those indexes. – Barmar Feb 27 '18 at 17:34
  • 1
    The note you quoted has nothing to do with efficiency, so how is it relevant to your main question? – Barmar Feb 27 '18 at 17:36
  • @Barmar - true what you said on efficiency, so edited the title of the question. And thnx for the tip on how to handle this in case of static files. I have multiple files but in themselves they are indeed static, so I might try something along the lines of your advice. – Montmons Feb 27 '18 at 17:41
  • 3
    Just ignore the "This implies that most permutations of a long sequence can never be generated." sentence. It has essentially _no_ practical implication for real-world code. – Mark Dickinson Feb 27 '18 at 17:43
  • You could always use numpy to just choose all your indices beforehand and iterate through the index list. You're reading the whole file in memory anyway, but not sure I understand the issue – roganjosh Feb 27 '18 at 17:44
  • 5
    I don't understand your question, why doesn't loading the file into a list then using `random.shuffle` on the list work for you? Aside from the fact that you are doing `shuffled_lines = random.shuffle(lines)` which will return `None`... – juanpa.arrivillaga Feb 27 '18 at 17:44
  • @roganjosh why would you choose indices, why not just *shuffle the list of lines*? How would that be faster? Note, I do believe that the `numpy` implementation of `shuffle` is indeed faster, so that is a good suggestion if speed is critical. – juanpa.arrivillaga Feb 27 '18 at 17:57
  • 1
    Have you run into a performance issue with this approach or are you just anticipating one? If you have had a performance issue is the performance issue related to being low on memory? – Steven Rumbalski Feb 27 '18 at 18:13
  • @StevenRumbalski No I didn't yet ran into one, although my program does seem to slow down a bit when it reaches the iteration function. This led me to the Python Docs and after reading the info there I thought I most likely wasn't taking the best approach, but I couldn't really think of a better solution myself, so I posted the question here. – Montmons Feb 27 '18 at 18:18
  • @Montmons why do you think you aren't taking the best approach? Again, you haven't really articulated *what the problem is*. – juanpa.arrivillaga Feb 27 '18 at 18:19

3 Answers3

5

As Mark Dickinson says, the doc line you are quoting has essentially no practical implications for real-world code. It definitely doesn't have any relevance for your code.

It doesn't matter whether the shuffle produces a truly uniform random distribution over all possible permutations. What matters is whether the shuffle is distinguishable from such a distribution, up to some standard of distinguishability. random.shuffle is statistically indistinguishable from a perfectly random shuffle up to the quality of the underlying Mersenne Twister algorithm, and the ways in which it is distinguishable have nothing to do with the period.

You don't need to do anything special to make your setup "work as intended". random.shuffle already works.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Ok, I have to admit that in hindsight my question was a bit of a non-question. I sensed my program slowed down just a bit when having to shuffle the big file. Went into the Python docs, discovered the quoted doc line, misinterpreted it and thought that I wasn't taking the right approach. Your answer did clear things up, so thnx! – Montmons Feb 28 '18 at 11:26
-1

You're going to have trouble doing this "quickly and efficiently" in Python, but if you must, the place to start is going to be a shuffling algorithm like the Fisher-Yates algorithm.

Once you implement that, load your files, and record at which byte offset each line starts at. Shuffle that array, open your files, then iterate over your array, and read from the offset to the next newline.

With datasets as large as you're proposing, it's reasonable to expect that lines = f.readlines() will simply be too much memory pressure, demanding a more complex, but more scalable solution, using offsets.

For more efficient reruns, perhaps also consider saving out the offset metadata once it's generated, so you don't need to walk over the whole file (or the whole files) every time.

Adam Barnes
  • 2,922
  • 21
  • 27
-1

I'd rather do a shuffle on a list of integers than the huge lines.
(Integers being the index/position of the line in the list of lines)
Something like this:

import random
from random import randint

def line_function(line):
    # Do something with given line

def random_itteration(filepath):
    with open(filepath) as f:
        lines = f.readlines()
        count = len(lines)
        #random_index_list = random.shuffle(list(xrange(count)))
        random_index_list = random.sample(range(count+1),count)
        for index in random_index_list:
            result = line_function(lines[index])

        #shuffled_lines = random.shuffle(lines)
        #for line in shuffled_lines:
        #    result = line_function(line)
murphy1310
  • 647
  • 1
  • 6
  • 13
  • 1
    He wants to process *all* lines in a random order, not just one random line. – Barmar Feb 27 '18 at 17:42
  • Tried this one. Problem here is that it is hard to avoid having to process the same line twice, which is not what I want. I tried avoiding it by keeping a list of the used index numbers and then checking against that list, but it didn't seem that efficient, nor fast. – Montmons Feb 27 '18 at 17:44
  • "I'd rather do a shuffle on a list of integers than the huge lines." Why? What advantage does that give you? – juanpa.arrivillaga Feb 27 '18 at 17:45
  • Shuffling a list of strings is really just shuffling a list of pointers, it doesn't copy the strings themselves. But making a list of indexes beforehand means you don't need to read the whole file into memory at all. – Barmar Feb 27 '18 at 17:47
  • Hmm, after the edits this looks quite promising.. I'm gonna have a go at this one :). I'll report back later. – Montmons Feb 27 '18 at 17:51
  • 2
    @Montmons **why does this look promising**? What advantage does this approach give you? Not to mention, this is incorrect, since `random_index_list = random.shuffle(list(xrange(count)))` will return `None` – juanpa.arrivillaga Feb 27 '18 at 17:53
  • @juanpa.arrivillaga let's find out – murphy1310 Feb 27 '18 at 17:53
  • @Barmar: To know the number of lines in a file you would need to read the file so that would violate the double pass rule (I am assuming the lines are not fixed width.) – Steven Rumbalski Feb 27 '18 at 17:53
  • @murphy1310 dude, it will *have* to be slower than just shuffling the list of lines, because then you won't have to index into the list, and will just iterate over the shuffled list directly. – juanpa.arrivillaga Feb 27 '18 at 17:54
  • 1
    @StevenRumbalski As I mentioned in a comment below the question, since the files are static you can create another file containing line indexes. This is a one-time cost that doesn't have to be incurred each time you shuffle. – Barmar Feb 27 '18 at 17:57
  • @Barmar: I saw the restriction in the question "dividing it into smaller files unfortunately isn't an option" and figured all preprocessing was out. Your suggested approach is indeed a memory saver, but I remain a little skeptical. If the files average 100 characters per line a python list of 1 million lines should be about 120mb (on 32 bit). The naive approach may perform just fine. I wouldn't recode until I ran into a performance problem. Managing random seeks will be a slowdown that is only worth it if it avoids memory issues. – Steven Rumbalski Feb 27 '18 at 18:06
  • Well, tested it and it is indeed slower.. so I'll try @Barmar solution of creating additional files to hold the line indexes. I'll report my findings later. I didn't expect the question to draw as much attention as it did. Was hoping there may have been a simple solution (e.g. another shuffle option from a pip package or something) I didn't know about. But I'm glad to have received some knowledgeable answers. – Montmons Feb 27 '18 at 18:11
  • 3
    @Montmons dude, **what** is it about `random.shuffle` that doesn't work for you? Is it too slow? – juanpa.arrivillaga Feb 27 '18 at 18:14