0

I need to open a csv file, select 1000 random rows and save those rows to a new file. I'm stuck and can't see how to do it. Can anyone help?

nancyh
  • 139
  • 2
  • 3
  • 6

2 Answers2

22

So there are two parts to this problem. Firstly getting every row of your csv, secondly randomly sampling. I would suggest constructing your list of rows with a list comprehension. Something along the lines of:

with open("your_file.csv", "rb") as source:
    lines = [line for line in source]

Once you've got that you want to take a random sample of those lines. Luckily python has a function that does just that.

import random
random_choice = random.sample(lines, 1000)

Once you've got those lines you want to write them back to a new file (though I assume you already know how given that a quick google reveals this), so I will include an example just for completeness's sake:

with open("new_file.csv", "wb") as sink:
    sink.write("\n".join(random_choice))

which just outputs your choice as a newline delimited string to the file of your choice. It's also worth noting that in this case it doesn't really matter that you're dealing with a csv, just another file with some lines.

If you're working with a very large file or concerned about taking up too much memory you should replace the above list comprehension with a generator and then sample from that instead, but that process isn't nearly as straightforward. If you want advice on making that more performant you should look at this question: Python random sample with a generator iterable iterator

Community
  • 1
  • 1
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • I tried this and got the following error message. Traceback (most recent call last): File "random.py", line 41, in import random File "/auto/data/nhine/Python/random.py", line 42, in random_choice = random.sample(lines, 1000) AttributeError: 'module' object has no attribute 'sample' – nancyh Aug 19 '13 at 13:42
  • 1
    You've got a namespace error. Don't name your file random.py – Slater Victoroff Aug 19 '13 at 13:44
  • I have now got the code to run ( I had called my file random.py and that was causing problems), but it doesn't seem to be producing an output file. – nancyh Aug 19 '13 at 13:54
-1

The basic procedure is this:

1. Open the input file

This can be accomplished with the basic builtin open function.

2. Open the output file

You'll probably use the same method that you chose in step #1, but you'll need to open the file in write mode.

3. Read the input file to a variable

It's often preferable to read the file one line at a time, and operate on that one line before reading the next, but if memory is not a concern, you can also read the entire thing into a variable all at once.

4. Choose selected lines

There will be any number of ways to do this, depending on how you did step #3, and your requirements. You could use filter, or a list comprehension, or a for loop with an if statement, etc. The best way depends on the particular constraints of your goal.

5. Write the selected lines

Take the selected lines you've chosen in step #4 and write them to the file.

6. Close the files

It's generally good practice to close the files you've opened to prevent resource leaks.

Brionius
  • 13,858
  • 3
  • 38
  • 49
  • The csv module does not open files, nor is it complicated. Also in python you should never have to explicitly close a file since the `with` syntax is so powerful. Also OP is looking for randomly selected lines, not a filter. – Slater Victoroff Aug 19 '13 at 13:41
  • After looking at the docs, you're right about csv not opening files directly - haven't used it myself. I guess "complicated" is pretty subjective, but sure. As for `with`, it internally closes the file. If the OP decides to use `with`, he'll be doing that anyways. And as for the random part, I interpreted "random" in a colloquial sense. If he really meant he's going to sample them using a pseudorandom generator, then I misunderstood. – Brionius Aug 19 '13 at 13:44
  • Using a filter to randomly sample is extremely inefficient, unintuitive, and generally difficult to read if you can even get it working. There's a difference between a module internally closing a file and directly calling the close method, and a confusion between the two can lead to all kinds of silly errors (like IOErrors closing already closed files.) I think this answer would be great for another question, but it doesn't seem to take the OP's question, or python into account. – Slater Victoroff Aug 19 '13 at 13:51