6

What I would like to do is shuffle the rows (read from CSV), then print out the first randomized 10,000 rows to one csv and the remainder to a separate csv. With a smaller file I can do something like

java.util.Collections.shuffle(...)
for (int i=0; i < 10000; i++) printcsv(...)
for (int i=10000; i < data.length; i++) printcsv(...)

However with very large files I now get OutOfMemoryError

Prince John Wesley
  • 62,492
  • 12
  • 87
  • 94
deltanovember
  • 42,611
  • 64
  • 162
  • 244
  • You could memory map the file and read parts of the file. – Thomas Oct 24 '11 at 12:18
  • 3
    Sounds like you need more memory. :-) – nfechner Oct 24 '11 at 12:18
  • @Thomas I don't think, that that is the problem. The poster need to hold all entries in memory, if he wants to randomize them before writing them to file. – nfechner Oct 24 '11 at 12:19
  • I assume you have much, much more than 10,000 rows otherwise you shouldn't be getting an out of memory error. – Peter Lawrey Oct 24 '11 at 13:07
  • Exactly how long do these rows tend to be? Keeping them in memory in some compressed format might reduce the memory footprint quite a bit. But it won't scale indefinitely, of course. – G_H Oct 24 '11 at 13:36
  • How many rows can you handle before getting the exception? Whats the memory usage at that point? A few million is that 1-10m or 10-100m? – Stefan Oct 24 '11 at 17:17

6 Answers6

3

You could:

  • Use more memory or

  • Shuffle not the actual CSV rows, but only a collection of row numbers, and then read the input file line-by-line (buffered, of course) and write the line to one of the desired output files.

michael667
  • 3,241
  • 24
  • 32
  • 2
    This approach would preserve the original ordering. – Nicola Musatti Oct 24 '11 at 12:35
  • Good point. I thought the task was only to choose a random sample of all rows. However, if your assumption is correct, a subsequent in-memory shuffle of the output file might be possible, as it is much smaller than the original file. – michael667 Oct 24 '11 at 13:15
  • There are 2 output files and if he is correct than you have the same problem with the second file if the number of lines he can handle in memory is much smaller than the total number of lines. – Stefan Oct 24 '11 at 15:48
2

You could memory map the file and find all the newlines, store in an array of int or long where these are. Create an array of int indexes, and shuffle these. This should use about 8-32 bytes per line. If this doesn't fit into memory, you can use memory mapped files for these arrays as well.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
1

Use some sort of indexing scheme. Parse your CSV file once to get the number of rows (don't retain anything in memory, just parse over it) and choose 10,000 numbers from that range at random (make sure you avoid duplicates, for example with a Set<Integer> or something more sophisticated). Then parse over your CSV a second time, maintaining yet again a counter for the rows. If a row number corresponds to one of your randomly chosen numbers, output it to one CSV file. Output the rows with a non-matching number to the other file.

G_H
  • 11,739
  • 3
  • 38
  • 82
  • 2
    This would still preserve the original ordering. – Nicola Musatti Oct 24 '11 at 12:23
  • @NicolaMusatti: Whoops, you're right. Forgot about that requirement. In that case, I'd say he's better of with some simple database table to temporarily keep the data in. Maybe JavaDB to keep things simple... – G_H Oct 24 '11 at 12:26
1
  1. First of all, count the number of lines in the input file by reading its contents (but not storing it in memory). Call the number of lines N.
  2. Take a random sample of size 10,000 from the numbers 1..N.
  3. Read the input file from the beginning. For each line, if the line number is in the sample drawn in step 2, write the line to file1; otherwise, write it to file2.

Step 2 can be accomplished while performing step 1 by using reservoir sampling.

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

Here's one possible algorithm:

  1. Let MAX_LINES be the maximum number of lines in a manageable file;
  2. Read MAX_LINES from the input file, randomize these with your original algorithm and write them to a temporary file;
  3. Repeat 2. until there are no lines left in your input file;
  4. Let N be a random number between 0 and the number of temporary files you wrote; read the next line from the N-th temporary file;
  5. Repeat 4. until you read all the lines from all the files; the first 10000 times write each line to the first output file, write all the other lines to the other file.
Nicola Musatti
  • 17,834
  • 2
  • 46
  • 55
  • I see a problem with step 4 if MAX_LINES << TOTAL_LINES. You cant hold all files open at once and you will be reading a total(aproximated) of MAX_LINES*TOTAL_LINES/2 (for every line that needs to be outputted a file containing MAX_LINES has to be read but on average only half of it). The >> factors into it that the chance to be reading the same file for the next N is very small. (if MAX_LINES = TOTAL_LINES the chance would be 100%) – Stefan Oct 24 '11 at 15:37
  • p = MAX_LINES/TOTAL_LINES. The chance to read the next line from the same file is p. Otherwise you have to read a new file and skip on average half the rows. ((1-p)*MAX_LINES/2 + p*1) is the effort per line processed => ((1-p)*MAX_LINES/2 + p*1) * TOTAL_LINES = total effort. For p << 1 it gives the formula given above which leads to a 'strange' result. Its minimal for MAX_LINES = 1. But creating a few million files may not be an option either. – Stefan Oct 24 '11 at 15:53
  • When all's said and done, it all depends on how random the output is required to be. One could just give a random order to the files, rather than multiplexing their contents, or implement some form of open file pool. On the other hand if the requirements are strict, an attempt should be made to generate temp files of roughly equal size. – Nicola Musatti Oct 24 '11 at 22:00
0

If you know the number of lines in your file and if you're randomising complete rows, you can just randomise by line number and then read that selected row. Just select a random line via the Random class and store the list of random numbers, so you don't pick one twice.

BufferedReader reader = new BufferedReader(new FileReader(new File("file.cvs")));
BufferedWriter chosen = new BufferedWriter(new FileWriter(new File("chosen.cvs")));
BufferedWriter notChosen = new BufferedWriter(new FileWriter(new File("notChosen.cvs")));

int numChosenRows = 10000;
long numLines = 1000000000; 

Set<Long> chosenRows = new HashSet<Long>(numChosenRows+1, 1);
for(int i = 0; i < numChosenRows; i++) {
    while(!chosenRows.add(nextLong(numLines))) {
        // add returns false if the value already exists in the Set
    }
}

String line;
for(long lineNo = 0; (line = reader.readLine()) != null; lineNo++){
    if(chosenRows.contains(lineNo)){
        // Do nothing for the moment
    } else {
        notChosen.write(line);
    }
}

// Randomise the set of chosen rows

// Use RandomAccessFile to write the rows in that order

See this answer for the nextLong method, which produces a random long scaled to a particular size.

Edit: As most people, I overlooked the requirement for writing the randomly selected lines in a random order. I'm presuming that RandomAccessFile would help with that. Just randomise the List with the chosen rows and access them in that order. As for the unchosen ones, I edited the code above to simply ignore the chosen ones.

Community
  • 1
  • 1
ThomasH
  • 830
  • 1
  • 8
  • 23
  • RandomAccessFile and 'lines' dont go to well with each other. You would have to create a mapping of rows to start + end index also. – Stefan Oct 24 '11 at 15:57