0

I'm working with large CSV. How can I take a random sample of rows—say, 200 total—and recombine them into a CSV with the same structure as the original?

SWeko
  • 30,434
  • 10
  • 71
  • 106
Joe Mornin
  • 8,766
  • 18
  • 57
  • 82

3 Answers3

1

The procedure I would use is as follows:

  1. Generate 200 unique numbers between 0 and the number of lines in the CSV file.
  2. Read each line of the CSV file and keep a track of which line number your are reading. If its line number matches one of the numbers above, then output it.
Lawrence Woodman
  • 1,424
  • 9
  • 13
1

Use the Resevoir Sampling random sampling technique that does not require all records be in memory or the actual number of records be known. With it, you stream in you records one-by-one and probabilistically select them into the sample. Once the stream is exhausted, output the final sample records. The technique guarantees each record in the stream has the same probability of being in the final sample. That is to say, it generates a simple random sample.

Brent Worden
  • 10,624
  • 7
  • 52
  • 57
0

You can use random module's random.sample method to randomize a list of line offsets as shown below.

import random

# Fetching line offsets.
# Courtesy: Adam Rosenfield's tip about how to read a HUGE text file.
# http://stackoverflow.com/questions/620367/

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Part where you pick the random lines and copy to your new file
# My 2 cents.
randoffsets = random.sample(line_offset, 200)

with open('your_file') as f:
        for k in randoffsets:
    f.seek(k)
    f.readline() # and append to your new file

You could try to use linecache if it works for you but since linecache reads the entire file into memory I'm not sure how well it would work for a 6GB file.

Praveen Gollakota
  • 37,112
  • 11
  • 62
  • 61