3

I have a big CSV file whose size is not specific and maybe more than 4 GB. I need to read some rows from the file randomly as test cases to do some tests in an application.

It's impossible to read the full file in memory because it will raise an OutOfMemoryError exception.

One solution is to generate an array of some numbers falling in the range of the total number, then sort the list. At last read from the file line by line according to the number stored in the array. So I could get a random set of full rows from the csv file.

Is there a library or method to read a full row from a big csv file randomly?

One solution:

// generate random numbers
List<Integer> indexList = new ArrayList<>();
for (int i = 0; i < testCount; i++) {
    int random = faker.numberBetween(0, total);
    indexList.add(random);
}

// sort
Collections.sort(indexList);

// read from a file
List<String> list = new ArrayList<>();
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("test.csv"), "UTF-8"));

String line;
int lineNum = 0;
int pos = 0;
int currentNum = indexList.get(pos);
while ((line = reader.readLine()) != null) {

    while (currentNum == lineNum) {

        list.add(line);
        pos++;

        if (pos == testCount)
            break;

        currentNum = indexList.get(pos);
    }

    if (pos == testCount)
        break;

    lineNum++;
}

reader.close();
niaomingjian
  • 3,472
  • 8
  • 43
  • 78
  • 1
    Please add your tryings and codes that you implemented to ease the process of the solution. – Bahadir Tasdemir Feb 21 '17 at 08:57
  • Don't forget to initial JVM with higher heap memory by using -Xms variable – Kainix Feb 21 '17 at 09:05
  • 1
    You can also generate a random number `p` between 0 (inclusive) and the size of the file. Then `seek` (e.g. using [skip()](https://docs.oracle.com/javase/7/docs/api/java/io/FileInputStream.html#skip%28long%29)) to the position `p` inside the file. From there, scan for the next EOL, then read and return the following line. – JimmyB Feb 21 '17 at 09:05
  • You could generate your array of randoms, create a BufferedReader and skip to each random number. Might be faster than reading line by line. – Jeremy Grand Feb 21 '17 at 09:06
  • Possible duplicate of [How to get a random line of a text file in Java?](http://stackoverflow.com/questions/2218005/how-to-get-a-random-line-of-a-text-file-in-java) – walen Feb 21 '17 at 09:14
  • @walen: With CSV you can't read lines, though, you have to parse them because the line break could be part of a field. – Joey Feb 21 '17 at 09:22
  • @niaomingjian Fair point, but that means you're talking about "rows" not "lines". I'm editing your question to make that clear. – walen Feb 21 '17 at 09:39

1 Answers1

2

Reservoir sampling is an algorithm that comes to mind here. The nice thing about this is that you don't need to know how many items there are and you don't have to read the whole file into memory; just the next row as long as necessary.

Joey
  • 344,408
  • 85
  • 689
  • 683