1

We have a large set of data (bulk data) that needs to be checked if the record is existing in the database.

We are using SQL Server2012/JPA/Hibernate/Spring.

What would be an efficient or recommended way to check if a record exists in the database?

Our entity ProductCodes has the following fields:

private Integer productCodeId  // this is the PK
private Integer refCode1   // ref code 1-5 has a unique constraint
private Integer refCode2
private Integer refCode3
private Integer refCode4
private Integer refCode5
... other fields

The service that we are creating will be given a file where each line is a combination of refCode1-5.

The task of the service is to check and report all lines in the file that are already existing in the database.

We are looking at approaching this in two ways.

  1. Approach1: Usual approach.

    • Loop through each line and call the DAO to query the refCode1-5 if existing in the db.

    //psuedo code for each line in the file call dao. pass the refCode1-5 to query (select * from ProductCodes where refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?

    • given a large list of lines to check, this might be inefficient since we will be invoking the DAO xxxx number of times. If the file say consists of 1000 lines to check, this will be 1000 connections to the DB
  2. Approach2: Query all records in the DB approach

    • We will query all records in the DB
    • Create a hash map with concatenated refCode1-5 as keys
    • Loop though each line in the file validating against the hashmap

    • We think this is more efficient in terms of DB connection since it will not create 1000 connections to the DB. However, if the DB table has for example 5000 records, then hibernate/jpa will create 5000 entities in memory and probably crash the application

We are thinking of going for the first approach since refCode1-5 has a unique constraint and will benefit from the implicit index.

But is there a better way of approaching this problem aside from the first approach?

mikes
  • 25
  • 1
  • 5
  • Yeah, one of the traditional ways is to bulk-load into a temp-table; almost all RDBMSs have a utility for this. SQL Server also has something called a [table-valued parameter](http://stackoverflow.com/a/337864), which looks like it might help in that regard too. – Clockwork-Muse May 01 '14 at 10:30

3 Answers3

0

try something like a batch select statement for say 100 refCodes instead of doing a single select for each refCode.

construct a query like

select <what ever you want> from <table> where ref_code in (.....)

Construct the select projection in a way that not just gives you wnat you want but also the details of ref_code. Teh in code you can do a count or multi-threaded scan of resultset if DB said you got less refCodes that the number you codes you entered in query.

Nazgul
  • 1,892
  • 1
  • 11
  • 15
  • thanks for your suggestion. I think ref_code in (...) will not work for my case. Sorry, I wasn't clear on the query part, I have updated the question. The query that we will be having will be `(select * from ProductCodes where refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?)` – mikes May 01 '14 at 09:26
  • why can you do the same thing with (select * from ProductCodes where (refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?) or (refCode1=? and refCode2=? and refCode3=? and refCode4=? and refCode5=?) and so on. You just need to check for rows with a specific combination of 5 ref_codes. – Nazgul May 01 '14 at 09:31
0

You can try to use the concat operator.

select <your cols> from <your table> where concat(refCode1, refCode2, refCode3, refCode4, refCode5) IN (<set of concatenation from your file>);

I think this will be quite efficient and it may be worth to try to see if pre-sorting the lines and playing with the num of concatenation taken each times bring you some benefits.

elbuild
  • 4,869
  • 4
  • 24
  • 31
0

I would suggest you create a temp table in your application where all records from file are stored initially with batch save, and later you run a query joining new temp table and productCodes table to achieve filtering how you like. In this way you are not locking productCodes table many times to check individual row as SqlServer locks rows on select statement as well.

Zahid M
  • 1,033
  • 10
  • 10