Best way to compare big csv files?

Question

I must do an application, that compares some very big csv files, each one having 40,000 records. I have done an application, that works properly, but it spends a lot of time in doing that comparison, because the two files could be disordenated or have different records - for that I must iterate (40000^2)*2 times.

Here is my code:

  if (nomFich.equals("CAR"))
    {
    while ((linea = br3.readLine()) != null)
    {

                array =linea.split(",");
                spliteado = array[0]+array[1]+array[2]+array[8];

                FileReader fh3 = new FileReader(cadena + lista2[0]);
                BufferedReader bh3 = new BufferedReader(fh3);

                find=0;

                while (((linea2 = bh3.readLine()) != null))

                {
                    array2 =linea2.split(",");
                    spliteado2 = array2[0]+array2[1]+array2[2]+array2[8];


                    if (spliteado.equals(spliteado2))
                    {

                        find =1;
                    }

                }
                if (find==0)

                {
                    bw3.write("+++++++++++++++++++++++++++++++++++++++++++");
                    bw3.newLine();
                    bw3.write("Se han incorporado los siguientes CGI en la nueva lista");
                    bw3.newLine();
                    bw3.write(linea);
                    bw3.newLine();
                    aparece=1;
                }
                bh3.close();


    }

I think that using a Set in Java is a good option, like the following post suggests: Comparing two csv files in Java

But before I try it this way, I would like to know, if there are any better options.

Thanks for all.

score 3 · Answer 1 · answered Mar 24 '14 at 09:14

As far as I can interpret your code, you need to find out which lines in the first CSV file do not have an equal line in the second CSV file. Correct?

If so, you only need to put all lines of the second CSV file into a HashSet. Like so (Java 7 code):

Set<String> linesToCompare = new HashSet<>();
try (BufferedReader reader = new BufferedReader(new FileReader(cadena + lista2[0]))) {
    String line;
    while ((line = reader.readLine()) != null) {
        String[] splitted = line.split(",");
        linesToCompare.add(splitted[0] + splitted[1] + splitted[2] + splitted[8]);
    }
}

Afterwards you can simply iterate over the lines in the first CSV file and compare:

try (BufferedReader reader = new BufferedReader(new FileReader(...))) {
    String line;
    while ((line = reader.readLine()) != null) {
        String[] splitted = line.split(",");
        String joined = splitted[0] + splitted[1] + splitted[2] + splitted[8];
        if (!linesToCompare.contains(joined)) {
            // handle missing line here
        }
    }
}

Does that fit your needs?

I need to find the two things really, the new lines in each csv, and find the changes in two lines with the similar id, i will try the option that you said. But i have a new problem, i only can use jre 1.6 because this apps will work in a server where i can not change antything. — Distopic, Mar 24 '14 at 09:25

xsiraul · Accepted Answer · 2014-03-24T10:50:55.147

1

HashMap<String, String> file1Map = new HashMap<String, String>();

while ((String line = file1.readLine()) != null) {
  array =line.split(",");
  key = array[0]+array[1]+array[2]+array[8];
  file1Map.put(key, key);
}

while ((String line = file2.readLine()) != null) {
  array =line.split(",");
  key = array[0]+array[1]+array[2]+array[8];
  if (file1Map.containsKey(key)) {
    //if file1 has same line in file2
  }
  else {
    //if file1 doesn't have line like in file2
  }
}

edited Mar 24 '14 at 10:50

answered Mar 24 '14 at 09:19

xsiraul

414
1
5
16

1

The line "if (file1Map.containsKey(key, key))" it must be if (file1Map.containsKey(key)) i supose. – Distopic Mar 24 '14 at 09:39
This solutions is not works like a need, because i compare the other values of the keys. This only search not diferences between two records, only the lines in adition according to the key. – Distopic Mar 24 '14 at 10:45
Deckard27, may you give me some more info what do you want to do and a little example? Because I don't understand what do you want to do, what is your keys in line, etc. – xsiraul Mar 24 '14 at 10:53
For example i have the next line 214-007-03512-20025,214-007-03512-20574,47,513,-92,3,1,30 and i want to compare with the next line 214-007-03512-20025,214-007-03512-20574,47,513,-92,3,1,33 in this case the id unique for the record is 214-007-03512-20025,214-007-03512-20574, i want to return in a txt file that the change is 33 for 30 in the second line – Distopic Mar 24 '14 at 11:49
Pseudo code, what to do in my given example: ... file1Map.put(uniqueRecordId, 30); ... in 2nd while: ... file1Map.get(uniqueRecordId).equalsIgnoreCase("33"); – xsiraul Mar 24 '14 at 12:20

score 0 · Answer 3 · answered Mar 24 '14 at 09:11

Assuming this all won't fit in memory I would first convert the files to their stripped down versions (el0, el1, el2, el8, orig-file-line-nr-for-reference-afterwards) and then sort said files. After that you can stream through both files simultaneously and compare the records as you go... Taking the sorting out of the equation you only need to compare them 'about once'.

But I'm guessing you could do the same using some List/Array object that allows for sorting and storing in memory; 40k records really doesn't sound all that much to me, unless the elements are very big of course. And it's going to be magnitudes faster.

Best way to compare big csv files?

3 Answers3