Performance comparing and doing querys with two really big CSV files in JAVA

Question

I need to compare two csv files, each has around 500000 to 900000 lines (yes, they´re big), and I´d like to know which is the best way to do this.

What I need to do

Delete rows on CSV1 that are not in CSV2 using a key value (code)
Delete rows in each side at some given hours
Show difference on some fields like "Quantity", filtering by some fields like "City" or "date"

I could try to store each CSV file in a JAVA list, and create a Database (Using SQLite) with the final result (differences, and rows deleted), and then do querys against this database, like select only from one city, from some dates/hours or codes (or even all of them at the same time, the final user will apply the filters from the interface using checkboxes, or comboboxes)

Each CSV File look something similar to this

CITY;       CODE;          DATETIME;       Quantity
city1; city_1_code_1; DD/MM/YYYY hh:mm:ss;   2500

Im not sure which is the best way to do this performance-wise. Should I keep the data in memory and just use the lists to do the comparisons? if not, using SQLite is good enough to do this? or should I use something different? Am I missing a better way to do this operation?

Im developing this using JavaFX, and results should be showed in a Table (that's not a problem at all, just to put you in context)

Thanks in advance, and let me know if you need to know anything

Load everything into SQLite3 and do your operations with SQL. — tonypdmtr, Jun 16 '19 at 21:10
What is your deployment environment? Will you be constrained on memory? Why not do all this in memory? Why your focus on SQLite? — Basil Bourque, Jun 17 '19 at 02:50
I won't be constrained on memory. Pc using this will be with 16+ ram, i5. I'm using sqlite because it's the dB I've worked with, and I've been doing good so far with it. I can work in oracle, mysql, sql transact or access, but I thought using an integrated database would be a good solution. That's also why I've asked for a better solution (if any), because there could be one I don't know about which could do it much better. — Pedro Galindo, Jun 17 '19 at 23:45

score 1 · Accepted Answer · answered Jun 16 '19 at 21:32

You'll never know for sure until you test for performance, but it seems like SQLite can handle a million rows easily. Some Stack Overflow users seem to be able to work on much larger data sets.

From a maintainability perspective, using a database with proper indexing is the way to go if it's fast enough. If it's not fast enough for your needs, you may consider other, more complex approaches.

If you decide to use in-memory lists, you may consider using one of the high performance collections libraries available in the Java ecosystem. I can't recommend any, but you may take a look e.g. here to get an idea. Chances are though that, unless you operate in the entire collection very often, the SQLite approach might still be faster (again, testing is key).

Finally, a middle-of-the-road approach would be to use an in-memory database.

Performance comparing and doing querys with two really big CSV files in JAVA

1 Answers1