0

I have two large (10M lines) files, both data files. Each line contains a number of fields, the last 3 fields give the x, y, z position To check my random generator, I want to be sure that there is not a single line in one file with a position identical to any line in the second file. The only thing that occured to me is something like

loop over file1
   read file1: eventnr1 energy1 posX1 posY1 posZ1
   loop over file2
      read file2: eventnr2 energy2 posX2 posY2 posZ2
      if ( fabs(posX1 - posX2) < 0.00001 && fabs(posY1 - posY2) < 0.00001 etc...)

Of course, this is very time-consuming (I tried both a bash script and a C++ program, I am not sure which will be faster). Does anyone know of a smarter (faster) way?

To be clear, the files might be completely different except for one or two lines. Using UNIX "diff" would not work (too large files).

Best regards,

Machiel

  • 1
    Is it possible to read each file once into memory and execute algorithm on memory copy? –  May 26 '20 at 11:58
  • 1
    btw not getting the same values ever again isnt very "random" – 463035818_is_not_an_ai May 26 '20 at 12:00
  • 1
    You can try, if you can, using an hash map, just put every value of the fist file in the hash map and then you can check for each value in the second file this result in the case of no collision in the hash map with an O(N) algorithm – Fanto May 26 '20 at 12:28

2 Answers2

0
  • Read all contents of both files
  • Sort them
  • Start with pointers to first entries to both lists of entries and increment the one pointing to the smaller entry until you reach the end

This is O(N*logN) (for the sorting, the rest is linear), compared to O(N*N) with your brute force approach.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • I am not sure I understand. Sorting on the event-nummer would not work because they are unrelated. So, in file 1, at line 15000 the eventnummer might be 20000 (say) and then in file 2, there might be at line 10000, with eventnummer 21000 (or 18000) a line where the posX, posY and posZ are identical to the line in file 1. Sorting on posX maybe would work, but these are doubles. Wouldn't that be a problem? – Machiel Kolstein May 26 '20 at 13:30
  • @MachielKolstein I meant only the `x`,`y` and `z`, and if you need to know which line, also include the line number (but dont use it for sorting). Why do you think sorting doubles would be a problem? – 463035818_is_not_an_ai May 26 '20 at 13:33
  • Oh okay. I have never used sorting before. In C++, that would be std:sort right? I can only sort on X or Y or Z, but I guess that's fine. The only problem I see is that I first have to put both files in std::vectors, both with size 10M. Wouldn't that give memory problems? – Machiel Kolstein May 26 '20 at 13:44
  • @MachielKolstein 30 million doubles (or 60m for both files) is not something unrealistic, depends of course on how much memory you have available – 463035818_is_not_an_ai May 26 '20 at 13:48
  • Okay, thank you. I will try. It's worth to try because indeed, it would speed up quite a lot. I don't know if I have enough memory for it, but I will see. Thank you very much – Machiel Kolstein May 26 '20 at 13:54
  • I guess I would have to use a structure that stores linenumber and position X, Y and Z, and then sort a vector with that structure using a self-defined function that only uses position X. – Machiel Kolstein May 26 '20 at 14:12
  • @MachielKolstein yes, exactly. See eg here https://stackoverflow.com/questions/1380463/sorting-a-vector-of-custom-objects – 463035818_is_not_an_ai May 26 '20 at 14:13
0

0) If you have enough RAM to keep the fields of the smaller file in RAM you could do so.
0 a) Store it in a HashMap (if you can afford the overhead of it and can use a hash-funvtion that hashes numbers that are that similar that you assume them to be the same to the same value) -> checks cost O(1)
0 b) Sort the fileds in RAM (costs O(n * log n) and checks later cost O(log n) )

Iterate over the file that is not in ram and check for each value if you have it already in RAM.

This way you read both files only once and the costs are a) O(n), b) O(n log n)


1) If you can not load the smaller file in RAM: Perform the same action as in 0) for each chunk of data of the smaller file. That implies that you need to read the chunks (k chunks) from the one file and for each iterate through the other file.

This way you read the smaller file once and the other k times. The costs are a) O(kn), b) O(k n/klog n/k + nk*log n/k )

MrSmith42
  • 9,961
  • 6
  • 38
  • 49