0

I have a large data set (~15 million records) that's updated monthly and exported via CSV files. I need to match and merge this data into a database of approximately the same number of entities. The data source's business rules with regard to what qualifies as a matching record are not the same as mine; therefore, there may be matching records (according to my rules) within the source files.

The problem I'm running into is that matching records commonly appear one line after another in the CSV files. For any new data since the last import, unless I'm committing after every line, a check against the database for a match will return nothing until after the next commit, at which time all of the records in that commit (including the ones that should have matched other records) will be persisted.

For example, if my commit interval is 50 and 30 lines that match are read in between commits, those 30 records will be persisted instead of one merged record.

I have tried committing after every source record, but this drastically slows down the import for obvious reasons. Given the size of the data sets, even small hits to performance have a large impact on the overall run time.

The only thing I can think of to do is keep track of uncommitted records and match against them as well as what's in the database. I'm using Spring Data JPA for persistence (backed by Hibernate) and Spring Batch for import, if this makes a difference.

I feel there's something I may be missing, and I'm sure I'm not the first to solve this problem. Thanks in advance for any suggestions.

Luke
  • 768
  • 2
  • 14
  • 33

2 Answers2

0

What about using Java to load the csv file into a staging table. Fire off a stored procedure to move only the necessary rows from the staging table to the production table. When that's done have the stored procedure clean up (truncate) the staging table.

Striker
  • 339
  • 1
  • 3
  • 13
  • The same matching logic is used to import records via multiple means, including web service calls to the application. For something like this where the records come in one at a time, dumping the records into a staging table and then kicking off a stored procedure seems like overkill. I'd also like to keep my business logic out of the db, as it really should reside in the application as a matter of (IMO) proper design. If I were to use a staging table, I'd prefer to use the application to match the records; however, this would require more trips (or more involved queries) to the db. – Luke Jun 03 '14 at 16:59
0

You can use SQL MERGE command if your database supports it. That way you can delegate the matched rows problem to the database engine.

MERGE INTO tablename USING table_reference ON (condition)
WHEN MATCHED THEN
UPDATE SET column1 = value1 [, column2 = value2 ...]
WHEN NOT MATCHED THEN
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...
ali köksal
  • 1,217
  • 1
  • 12
  • 19
  • Unfortunately, Postgres doesn't support it - http://wiki.postgresql.org/wiki/SQL_MERGE. – Luke Jun 03 '14 at 17:24
  • Please take a look at [this post](http://stackoverflow.com/questions/17267417/how-do-i-do-an-upsert-merge-insert-on-duplicate-update-in-postgresql) – ali köksal Jun 04 '14 at 18:27
  • Thanks for the response; however, in this particular case, there are 38 attributes across the object graph that may be updated. Implementing the proposed solution would have me creating a function with 38 input params if I understand correctly. Additionally, I would be putting business logic in the database to control when the attributes were allowed to be updated - something I don't care to do. – Luke Jun 09 '14 at 17:15