Randomly mix lines of 3 million-line file

Question

Everything is in the title. I'm wondering if any one knows a quick and with reasonable memory demands way of randomly mixing all the lines of a 3 million lines file. I guess it is not possible with a simple vim command, so any simple script using Python. I tried with python by using a random number generator, but did not manage to find a simple way out.

You can see [this question](http://stackoverflow.com/questions/1287567/c-is-using-random-and-orderby-a-good-shuffle-algorithm) for some ideas. — alpha-mouse, Jan 06 '11 at 18:25
"did not manage to find a simple way out." Really? Please post the code that got too complex. — S.Lott, Jan 06 '11 at 18:26
Should have said, "did not manage to find a way out". I'm fairly new with python, so I only know some commands. What I was heading for was putting everything in a vector, choosing a random number between 1 and 3 million, take out that line, and start over again with a new random number with an extra condition excluding the previous random numbers. Etc. Hence my question for a simple way (which you and others provided). I'll accept yours as you have most up votes. Thanks to every one though... i learnt a lot! — Nigu, Jan 06 '11 at 19:10
There is a deeper problem here that is not being addressed: why are you trying to shuffle a file that large? It may be much simpler to create an an iterator that pulls shuffled lines out of the file. Unless less we know the reason for the shuffling, it's not really possible to give you an answer appropriate to the underlying problem (i.e. a 'good' andswer.) — arclight, Jan 10 '11 at 16:15
Check my answer down below. It should be by far fastest solution without any python code, only bash. — Drag0, May 13 '16 at 09:04

score 61 · Answer 1 · edited Jul 12 '20 at 14:00

61

Takes only a few seconds in Python:

import random
lines = open('3mil.txt').readlines()
random.shuffle(lines)
open('3mil.txt', 'w').writelines(lines)

edited Jul 12 '20 at 14:00

Daniel

3,115
5
28
39

answered Jan 06 '11 at 18:28

John Kugelman

349,597
67
533
578

6

It certainly *does* work, and works well. That it can only generate 2**19937 permutations is trivial bordering on irrelevant. Any RNG-based shuffle will have this same "limitation". – John Kugelman Jan 06 '11 at 18:58
2

How is a `sort()`-based solution any better than `shuffle()`? It does not avoid this supposed problem. – John Kugelman Jan 06 '11 at 19:00
With the `sort` solution, you can substitute in `os.urandom()` to get a (potentially) much larger period. Although looking through the documentation, I notice there's a `random.SystemRandom()` method, which I suspect supports the `shuffle` method as well. Given that edit, I'll gladly remove the downvote. – Chris B. Jan 06 '11 at 19:07
4

@Chris You are misinterpreting that answer. Not being able to generate all possible permutations is not the same as being unable to randomly shuffle the list at all. I hate to be so argumentative but your cautionary note is a **vast** misunderstanding of the post you link to, and downvoting my answer is flat out wrong. – John Kugelman Jan 06 '11 at 19:09
1

I think I understand it just fine. Using `random.shuffle` with a 3 million line list means some permutations are not possible because you'll hit the period of the RNG. Whether you require all possible permutations to be equally likely depends on the use case. – Chris B. Jan 06 '11 at 19:20
3

@Chris: ...and you haven't yet explained why you think using the same RNG, with the same period, to assign a random key to each element, and then sorting by that key, will make all the other permutations possible... – Karl Knechtel Jan 06 '11 at 19:51
It doesn't. But the solution allowed other sources of randomness to be used (namely `os.urandom`) which were potentially better. At the time, I was unaware of `random.SystemRandom` which is a wrapper around the `os.urandom` function, albeit without some methods like `jumpahead()` and `setstate()`. – Chris B. Jan 06 '11 at 20:06

S.Lott · Accepted Answer · 2011-01-06T21:54:21.393

38

import random
with open('the_file','r') as source:
    data = [ (random.random(), line) for line in source ]
data.sort()
with open('another_file','w') as target:
    for _, line in data:
        target.write( line )

That should do it. 3 million lines will fit into most machine's memory unless the lines are HUGE (over 512 characters).

edited Jan 06 '11 at 21:54

answered Jan 06 '11 at 18:28

S.Lott

384,516
81
508
779

3 millions line with average 80 characters per line will be about 240M Bytes, which is huge for loading a file in memory. – Vikram.exe Jan 06 '11 at 18:31
3

@Vikram.exe. Not really. This machine has 4Gb of memory. 240M is nothing. – S.Lott Jan 06 '11 at 18:33
@ S.Lott, yeah I agree its nothing, but I was just wondering if we can do it some how (with little effort) with out loading the whole file in memory. – Vikram.exe Jan 06 '11 at 18:36
25

@Vikram.exe: What's wrong with using memory? That's why we purchased it. – S.Lott Jan 06 '11 at 18:37
1

240M is not that bad given today's memory sizes. The limitation of about 2,000 items for `shuffle` is a more serious problem, although whether the lines need to be truly random is a question. – Chris B. Jan 06 '11 at 18:38
@Chris B.: Removed `shuffle()`. – S.Lott Jan 06 '11 at 18:42
Much better. Unless the lines are *huge* (and thus a memory problem), this should work just fine. – Chris B. Jan 06 '11 at 18:55
That won't exactly shuffle properly since the sort depends on 2 fields I had to use sort like `data.sort(key=lambda tup:tup[0])` for it to work – Jimmar Jan 08 '15 at 23:52
7

-1 this method has absolutely no advantage over `random.shuffle` (implemented as a [Fisher-Yates shuffle](http://programmers.stackexchange.com/questions/215737/how-python-random-shuffle-works/215780#215780), which has better runtime characteristics (`O(n)` instead of `O(N log(N)`).) – vladr Sep 30 '15 at 16:04

score 29 · Answer 3 · answered May 13 '16 at 09:02

29

I just tried this on a file with 4.3M of lines and fastest thing was 'shuf' command on Linux. Use it like this:

shuf huge_file.txt -o shuffled_lines_huge_file.txt

It took 2-3 seconds to finish.

answered May 13 '16 at 09:02

Drag0

8,438
9
40
52

I guess it depends on the size of the file and IO speed of your storage. On an SSD via NVMe with a ~1,000,000 lines file and average of ~4,000 characters per line, it took ~12 seconds for me. – Matteo B. Mar 21 '19 at 10:52

score 3 · Answer 4 · answered Jan 06 '11 at 18:28

3

On many systems the sort shell command takes -R to randomize its input.

answered Jan 06 '11 at 18:28

fuzzyTew

3,511
29
24

2

Note that the `-R` option will still sort identical lines together, which may not be the desired behavior. – Chris B. Jan 06 '11 at 18:44
3

`shuf` will randomize lines without regard to equality, and is perhaps the quickest solution – fuzzyTew Jan 06 '11 at 18:49

S.Lott · Answer 5 · 2011-01-06T21:55:09.007

2

Here's another version

At the shell, use this.

python decorate.py | sort | python undecorate.py

decorate.py

import sys
import random
for line in sys.stdin:
    sys.stdout.write( "{0}|{1}".format( random.random(), line ) )

undecorate.py

import sys
for line in sys.stdin:
    _, _, data= line.partition("|")
    sys.stdout.write( line )

Uses almost no memory.

edited Jan 06 '11 at 21:55

answered Jan 06 '11 at 18:36

S.Lott

384,516
81
508
779

As posted above, `sort -R` sorts by a random key. Easier than decorating and undecorating the file. – Chris B. Jan 06 '11 at 18:39
@Chris B. As you pointed out above, `-R` will still group identical lines. This will not. So if that's the desired behavior, then this is the way to go. – aaronasterling Jan 06 '11 at 18:48
1

As fuzzyTew pointed out above, `shuf` will randomize lines with each permutation equally likely, and doesn't require custom code. That's clearly better than writing and debugging your own program. – Chris B. Jan 06 '11 at 19:10

score 2 · Answer 6 · answered Jan 06 '11 at 19:00

2

This is the same as Mr. Kugelman's, but using vim's built-in python interface:

:py import vim, random as r; cb = vim.current.buffer ; l = cb[:] ; r.shuffle(l) ; cb[:] = l

answered Jan 06 '11 at 19:00

sleepynate

7,926
3
27
38

score 2 · Answer 7 · answered Jan 06 '11 at 19:23

If you do not want to load everything into memory and sort it there, you have to store the lines on disk while doing random sorting. That will be very slow.

Here is a very simple, stupid and slow version. Note that this may take a surprising amount of diskspace, and it will be very slow. I ran it with 300.000 lines, and it takes several minutes. 3 million lines could very well take an hour. So: Do it in memory. Really. It's not that big.

import os
import tempfile
import shutil
import random
tempdir = tempfile.mkdtemp()
print tempdir

files = []
# Split the lines:
with open('/tmp/sorted.txt', 'rt') as infile:
    counter = 0    
    for line in infile:
        outfilename = os.path.join(tempdir, '%09i.txt' % counter)
        with open(outfilename, 'wt') as outfile:
            outfile.write(line)
        counter += 1
        files.append(outfilename)

with open('/tmp/random.txt', 'wt') as outfile:
    while files:
        index = random.randint(0, len(files) - 1)
        filename = files.pop(index)
        outfile.write(open(filename, 'rt').read())

shutil.rmtree(tempdir)

Another version would be to store the files in an SQLite database and pull the lines randomly from that database. That is probably going to be faster than this.

"That will be very slow"? Slower yes. Very slow is disputable. Each individual step is pretty quick. — S.Lott, Jan 06 '11 at 21:05
@S.Lott: Well, depends on the filesystem. I used ext3. 30.000 items took 5.5 seconds. 100.000 items took 16.3 seconds. 200.000 items takes 339 seconds. I think the directory lookup gets slow with many items. 3 million items will take *hours*. At least. A database could be reasonably fast, but I can't be bothered to test. :-) Another option would be to read through the file and make an index on the start position of each item, and do seek()s. That should be faster than this. — Lennart Regebro, Jan 06 '11 at 22:25
Interesting Data. I guess I've spent too long using very large servers. — S.Lott, Jan 06 '11 at 22:44

Aziz Alto · Answer 8 · 2018-02-12T02:50:10.313

1

Here is another way using random.choice, this may provide some gradual memory relieve as well, but with a worse Big-O :)

from random import choice

with open('data.txt', 'r') as r:
    lines = r.readlines()

with open('shuffled_data.txt', 'w') as w:
    while lines:
        l = choice(lines)
        lines.remove(l)
        w.write(l)

edited Feb 12 '18 at 02:50

answered Feb 08 '18 at 15:59

Aziz Alto

19,057
5
77
60

"a better Big-O" <- Unfortunately not :-(. The repeated removal in `lines.remove(l)` gives your algorithm a running time that's quadratic in the number of lines. It'll be unusable (running time of hours to days) for a 3 million line file. – Mark Dickinson Feb 11 '18 at 09:54

Akib Sadmanee · Answer 9 · 2020-07-03T04:09:54.640

It is not a necessary solution to your problem. Just keeping it here for the people who come here seeking solution for shuffling a file of bigger size. But it will work for smaller files as well. Change split -b 1GB to a smaller file size i.e. split -b 100MB to make a lot of text files each sizing 100MB.

I had a 20GB file containing more than 1.5 billion sentences in it. Calling shuf command in the linux terminal simply overwhelmed both my 16GB RAM and a same swap area. This is a bash script I wrote to get the job done. It assumes that you keep the bash script in the same folder as your big text file.

#!/bin

#Create a temporary folder named "splitted" 
mkdir ./splitted


#Split input file into multiple small(1GB each) files
#This is will help us shuffle the data
echo "Splitting big txt file..."
split -b 1GB ./your_big_file.txt ./splitted/file --additional-suffix=.txt
echo "Done."

#Shuffle the small files
echo "Shuffling splitted txt files..."
for entry in "./splitted"/*.txt
do
  shuf $entry -o $entry
done
echo "Done."

#Concatinate the splitted shuffled files into one big text file
echo "Concatinating shuffled txt files into 1 file..."
cat ./splitted/* > ./your_big_file_shuffled.txt
echo "Done"

#Delete the temporary "splitted" folder
rm -rf ./splitted
echo "Complete."

builder-7000 · Answer 10 · 2018-03-05T04:09:53.090

The following Vimscript can be used to swap lines:

function! Random()                                                       
  let nswaps = 100                                                       
  let firstline = 1                                                     
  let lastline = 10                                                      
  let i = 0                                                              
  while i <= nswaps                                                      
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe line.'d'                                                         
    exe "let line = system('shuf -i ".firstline."-".lastline." -n 1')[:-2]"
    exe "normal! " . line . 'Gp'                                         
    let i += 1                                                           
  endwhile                                                               
endfunction

Select the function in visual mode and type :@" then execute it with :call Random()

score -3 · Answer 11 · answered Jun 12 '19 at 11:01

-3

This will do the trick: My solution even don't use random and it will also remove duplicates.

import sys
lines= list(set(open(sys.argv[1]).readlines()))
print(' '.join(lines))

in the shell

python shuffler.py nameoffilestobeshuffled.txt > shuffled.txt

answered Jun 12 '19 at 11:01

Kumaresp

35
1
12

1

While set does not preserve the input order the order is far from random and must not be relied upon. – Pyfisch Feb 08 '20 at 20:28

Randomly mix lines of 3 million-line file

11 Answers11

Linked