Sampling random lines from a CSV

Question

I'm working with large CSV. How can I take a random sample of rows—say, 200 total—and recombine them into a CSV with the same structure as the original?

And how large a CSV are we talking about? Large as in a few MB, or large as in a few GB? — SWeko, Mar 22 '11 at 14:31
Pseudo-random is fine. I'm working with a database of federal campaign contributions. It's just under 6 GB. I'd prefer Python in this case, but I'm open to any workable solution. — Joe Mornin, Mar 22 '11 at 15:24

Lawrence Woodman · Accepted Answer · 2011-03-22T15:36:55.840

1

The procedure I would use is as follows:

Generate 200 unique numbers between 0 and the number of lines in the CSV file.
Read each line of the CSV file and keep a track of which line number your are reading. If its line number matches one of the numbers above, then output it.

edited Mar 22 '11 at 15:36

answered Mar 22 '11 at 14:53

Lawrence Woodman

1,424
9
13

score 1 · Answer 2 · answered Mar 22 '11 at 14:58

Use the Resevoir Sampling random sampling technique that does not require all records be in memory or the actual number of records be known. With it, you stream in you records one-by-one and probabilistically select them into the sample. Once the stream is exhausted, output the final sample records. The technique guarantees each record in the stream has the same probability of being in the final sample. That is to say, it generates a simple random sample.

score 0 · Answer 3 · answered Mar 22 '11 at 16:46

You can use random module's random.sample method to randomize a list of line offsets as shown below.

import random

# Fetching line offsets.
# Courtesy: Adam Rosenfield's tip about how to read a HUGE text file.
# http://stackoverflow.com/questions/620367/

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Part where you pick the random lines and copy to your new file
# My 2 cents.
randoffsets = random.sample(line_offset, 200)

with open('your_file') as f:
        for k in randoffsets:
    f.seek(k)
    f.readline() # and append to your new file

You could try to use linecache if it works for you but since linecache reads the entire file into memory I'm not sure how well it would work for a 6GB file.

Sampling random lines from a CSV

3 Answers3