I have a large data set (~15 million records) that's updated monthly and exported via CSV files. I need to match and merge this data into a database of approximately the same number of entities. The data source's business rules with regard to what qualifies as a matching record are not the same as mine; therefore, there may be matching records (according to my rules) within the source files.
The problem I'm running into is that matching records commonly appear one line after another in the CSV files. For any new data since the last import, unless I'm committing after every line, a check against the database for a match will return nothing until after the next commit, at which time all of the records in that commit (including the ones that should have matched other records) will be persisted.
For example, if my commit interval is 50 and 30 lines that match are read in between commits, those 30 records will be persisted instead of one merged record.
I have tried committing after every source record, but this drastically slows down the import for obvious reasons. Given the size of the data sets, even small hits to performance have a large impact on the overall run time.
The only thing I can think of to do is keep track of uncommitted records and match against them as well as what's in the database. I'm using Spring Data JPA for persistence (backed by Hibernate) and Spring Batch for import, if this makes a difference.
I feel there's something I may be missing, and I'm sure I'm not the first to solve this problem. Thanks in advance for any suggestions.