I need to compare two csv files, each has around 500000 to 900000 lines (yes, they´re big), and I´d like to know which is the best way to do this.
What I need to do
- Delete rows on CSV1 that are not in CSV2 using a key value (code)
- Delete rows in each side at some given hours
- Show difference on some fields like "Quantity", filtering by some fields like "City" or "date"
I could try to store each CSV file in a JAVA list, and create a Database (Using SQLite) with the final result (differences, and rows deleted), and then do querys against this database, like select only from one city, from some dates/hours or codes (or even all of them at the same time, the final user will apply the filters from the interface using checkboxes, or comboboxes)
Each CSV File look something similar to this
CITY; CODE; DATETIME; Quantity
city1; city_1_code_1; DD/MM/YYYY hh:mm:ss; 2500
Im not sure which is the best way to do this performance-wise. Should I keep the data in memory and just use the lists to do the comparisons? if not, using SQLite is good enough to do this? or should I use something different? Am I missing a better way to do this operation?
Im developing this using JavaFX, and results should be showed in a Table (that's not a problem at all, just to put you in context)
Thanks in advance, and let me know if you need to know anything