Need Better Algorithm to Scrub SQL Server Table with Java

Question

I need to scrub an SQL Server table on a regular basis, but my solution is taking ridiculously long (about 12 minutes for 73,000 records).

My table has 4 fields:

id1
id2
val1
val2

For every group of records with the same "id1", I need to keep the first (lowest id2) and last (highest id2) and delete everything in between UNLESS val1 or val2 has changed from the previous (next lowest "id2") record.

If you're following me so far, what would a more efficient algorithm be? Here is my java code:

boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
    thisID1 = qrs.getInt("id1");
    thisID2 = qrs.getInt("id2");
    thisVAL1= qrs.getInt("val1");
    thisVAL2= qrs.getDouble("val2");
    if (thisID1==lastID1) {     
        if (bDEL) {             //Ensures this is not the last record
            qps2 = conn2.prepareStatement("DELETE FROM STATUS_DATA where id1="+lastID1+" and id2="+lastID2);
            qps2.executeUpdate();
            qps2.close();
            bDEL = false;
        }
        if (thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
            bDEL = true;
        }
    } else if (bDEL) bDEL=false;
    lastID1 = thisID1;
    lastID2 = thisID2;
    lastVAL1= thisVAL1;
    lastVAL2= thisVAL2;
}

UPDATE 4/20/2015 @ 11:10 AM

OK so here is my final solution - for every record, the Java code enters an XML record into a string which is written to file every 10,000 records and then java calls a stored procedure on SQL Server and passes the file name to read. The stored procedure can only use the file name as a variable if dynamic SQL is used to execute the openrowset. I will play around with the interval of procedure execution but so far my performance results are as follows:

BEFORE (1 record delete at a time):
73,000 records processed, 101 records per second

AFTER (bulk XML import):
1.4 Million records processed, 5800 records per second

JAVA SNIPPET:

String ts, sXML = "<DataRecords>\n";
boolean bDEL=false;
qps = conn.prepareStatement("SELECT id1, id2, val1, val2 from STATUS_DATA ORDER BY id1, id2");
qrs = qps.executeQuery();
//KEEP FIRST & LAST, DISCARD EVERYTHING ELSE *EXCEPT* WHERE CHANGE IN val1 or val2
while (qrs.next()) {
    thisID1 = qrs.getInt("id1");
    thisID2 = qrs.getInt("id2");
    thisVAL1= qrs.getInt("val1");
    thisVAL2= qrs.getDouble("val2");
        if (bDEL && thisID1==lastID1) {                             //Ensures this is not the first or last record
            sXML += "<nxtrec id1=\""+lastID1+"\" id2=\""+lastID2+"\"/>\n";
            if ((i + 1) % 10000 == 0) {                                 //Execute every 10000 records
                sXML += "</DataRecords>\n";                             //Close off Parent Tag
                ts = String.valueOf((new java.util.Date()).getTime());  //Each XML File Uniquely Named
                writeFile(sDir, "ds"+ts+".xml", sXML);                  //Write XML to file

                conn2=dataSource.getConnection();
                cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
                cs.setString(1, sdir + "ds"+ts+".xml");
                cs.executeUpdate();                                     //Execute Stored Procedure
                cs.close(); conn2.close();
                deleteFile(SHMdirdata, "ds"+ts+".xml");                 //Delete File

                sXML = "<DataRecords>\n";
            }
            bDEL = false;
        }
        if (thisID1==lastID1 && thisVAL1==lastVAL1 && thisVAL2==lastVAL2) {
            bDEL = true;
        } else if (bDEL) bDEL=false;
    } else if (bDEL) bDEL=false;
    lastID1 = thisID1;
    lastID2 = thisID2;
    lastVAL1= thisVAL1;
    lastVAL2= thisVAL2;
    i++;
}
qrs.close(); qps.close(); conn.close();

sXML += "</DataRecords>\n";
ts = String.valueOf((new java.util.Date()).getTime());
writeFile(sdir, "ds"+ts+".xml", sXML);

conn2=dataSource.getConnection();
cs = conn2.prepareCall("EXEC SCRUB_DATA ?");
cs.setString(1, sdir + "ds"+ts+".xml");
cs.executeUpdate();     
cs.close(); conn2.close();
deleteFile(SHMdirdata, "ds"+ts+".xml");

XML FILE OUTPUT:

<DataRecords>
<nxtrec id1="100" id2="1112"/>
<nxtrec id1="100" id2="1113"/>
<nxtrec id1="100" id2="1117"/>
<nxtrec id1="102" id2="1114"/>
...
<nxtrec id1="838" id2="1112"/>
</DataRecords>

SQL SERVER STORED PROCEDURE:

PROCEDURE [dbo].[SCRUB_DATA] @floc varchar(100)     -- File Location (dir + filename) as only parameter 

BEGIN
        SET NOCOUNT ON;

        DECLARE @sql as varchar(max);
        SET @sql = '
                DECLARE @XmlFile XML

                SELECT @XmlFile = BulkColumn 
                FROM  OPENROWSET(BULK ''' + @floc + ''', SINGLE_BLOB) x;

                CREATE TABLE #TEMP_TABLE (id1 INT, id2 INT);

                INSERT INTO #TEMP_TABLE (id1, id2)  
                SELECT
                        id1 = DataTab.value(''@id1'', ''int''),
                        id2 = DataTab.value(''@id2'', ''int'')
                FROM
                        @XmlFile.nodes(''/DataRecords/nxtrec'') AS XTbl(DataTab);

                delete from D
                from STATUS_DATA D
                inner join #TEMP_TABLE T on ( (T.id1 = D.id1) and (T.id2 = D.id2) );    
        ';
    EXEC (@sql);    
END

score 1 · Accepted Answer · edited May 23 '17 at 11:45

It is almost for certain that your performance issues are not in your algorithm, but rather in the implementation. Say for example your cleanup step has to remove 10,000 records, this means you will have 10000 round trips to your database server.

Instead of doing that, write each of the id pairs to be deleted to an XML file, and send that XML file to SQL server stored proc that shreds the XML into a corresponding temp or temp_var table. Then use a single delete from (or equivalent) to delete all 10K rows.

If you don't know how to shred xml in TSQL, it is well worth the time to learn. Take a look at a simple example to get you started, out just check out a couple of search results for "tsql shred xml" to get going.

ADDED

Pulling 10K records to client should be < 1 second. Your Java code likewise. If you don't have the time to learn use XML as suggested, you could write a quick an dirty stored proc that accepts 10 (20, 50?) pairs of ids and delete the corresponding records from within the stored proc. I use the XML approach regularly to "batch" stuff from the client. If your batches are "large", you might take a look at using the BULK INSERT command on SQL Server -- but the XML is easy and a bit more flexible as it can contain nested data structures. E.g., master/detail relationships.

ADDED

I just did this locally

create table #tmp
(
  id int not null
  primary key(id)
)
GO
insert #tmp (id)
  select 4
union
  select 5
GO

-- now has two rows #tmp

delete from L
from TaskList L
inner join #tmp T on (T.id = L.taskID)

(2 row(s) affected)

-- and they are no longer in TaskList

i.e., this should not be a problem unless you are doing it wrong somehow. Are you creating the temp table and then attempting to use it in different databases connections/sessions. If the sessions are different, the temp table won't be seen in the 2nd session.

Hard to think of another way for this to be wrong off the top of my head.

Thanks for your comment - I agree that most of the the overhead is in the cleanup step with thousands of calls to delete. I will investigate the XML batch approach — Palladium, Apr 14 '14 at 20:12
Thanks Gary, with your answer my code is 58 times faster. I wrote out my full code in the original post — Palladium, Apr 20 '14 at 15:17

score 0 · Answer 2 · answered Apr 14 '14 at 21:01

0

Have you considered doing something that pushes more of the calculating to SQL instead of java?

This is ugly and doesn't take into account your "value changing" part, but it could be a lot faster:

(This deletes everything except the highest and lowest id2 for each id1)

select * into #temp 
FROM (SELECT ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2) AS 'RowNo', 
* from myTable)x 


delete from myTable i
    left outer join
    (select t.* from #temp t
        left outer join (select id1, max(rowNo) rowNo from #temp group by id1) x 
        on x.id1 = t.id1 and x.rowNo = t.RowNo
    where t.RowNo != 1 and x.rowNo is null)z 
    on z.id2 = i.id2 and z.id1 = i.id1
where z.id1 is not null

answered Apr 14 '14 at 21:01

geojameson

11
1

I notice you conveniently skipped over the "value changing" part -- the part that happens to be ugly to write as set based logic. In principal using set based logic is almost always the correct approach -- if you introduce a cursor it defeats the purpose. You could write a CLR proc to avoid pushing data around the network, but CLR procs are relatively speaking harder to write and debug than Java Code on the client. It rapidly gets worse as you introduce more complex requirements for inter-row dependencies in set based logic. – Gary Walker Apr 14 '14 at 22:18
@GaryWalker I'm going to say 'au contraire'. It's mostly a matter of what you're used to. Doing things set-based in SQL isn't necessarily more complex than doing it on a row-by-row basis, but I'll agree that it requires a bit more of experience/creativity to get it working (properly). – deroby Apr 15 '14 at 14:52

deroby · Answer 3 · 2014-04-15T14:48:56.343

Never underestimate the power of SQL =)

Although I understand this seems more 'straightforward' to implement in a row-by-row fashion, doing it 'set-based' will make it fly.

Some code to create test-data:

SET NOCOUNT ON


IF OBJECT_ID('mySTATUS_DATA') IS NOT NULL DROP TABLE mySTATUS_DATA 
GO

CREATE TABLE mySTATUS_DATA (id1 int NOT NULL,
                          id2 int NOT NULL PRIMARY KEY (id1, id2),
                          val1 varchar(100) NOT NULL,
                          val2 varchar(100) NOT NULL)

GO
DECLARE @counter int,
        @id1     int,
        @id2     int,
        @val1    varchar(100),
        @val2    varchar(100)

SELECT @counter = 100000,
       @id1     = 1,
       @id2     = 1,
       @val1    = 'abc',
       @val2    = '123456'

BEGIN TRANSACTION

WHILE @counter > 0
    BEGIN
        INSERT mySTATUS_DATA (id1, id2, val1, val2)
                    VALUES (@id1, @id2, @val1, @val2)

        SELECT @counter = @counter - 1
        SELECT @id2 = @id2 + 1
        SELECT @id1 = @id1 + 1, @id2 = 1 WHERE Rand() > 0.8
        SELECT @val1 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90
        SELECT @val2 = SubString(convert(varchar(100), NewID()), 0, 9) WHERE Rand() > 0.90

        if @counter % 1000 = 0
            BEGIN
                COMMIT TRANSACTION
                BEGIN TRANSACTION
            END

    END

COMMIT TRANSACTION

SELECT top 1000 * FROM mySTATUS_DATA
SELECT COUNT(*) FROM mySTATUS_DATA

And here the code to do the actual scrubbing. Mind that the why column is there merely for educational purposes. If you're going to put this in production I'd advice to put it into comments as it only slows down the operations. Also, you could combine the checks on val1 and val2 in 1 single update... in fact, with a bit of effort you probably can combine everything into 1 single DELETE statement. However, I very much doubt it would make things much faster... but it surely would make things a lot less readable. Anyway, when I run this on my laptop for 100k records it takes a only 5 seconds so I doubt performance is going to be an issue.

IF OBJECT_ID('tempdb..#working') IS NOT NULL DROP TABLE #working
GO

-- create copy of table
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
       val1, val2,
       keep_this_record = Convert(bit, 0),
       why = Convert(varchar(500), NULL)
  INTO #working
  FROM STATUS_DATA
 WHERE 1 = 2

-- load records
INSERT #working (id1, id2, id2_seqnr, val1, val2, keep_this_record, why)
SELECT id1, id2, id2_seqnr = ROW_NUMBER() OVER (PARTITION BY id1 ORDER BY id2),
       val1, val2,
       keep_this_record = Convert(bit, 0),
       why = ''
  FROM STATUS_DATA

-- index
CREATE UNIQUE CLUSTERED INDEX uq0 ON #working (id1, id2_seqnr)

-- make sure we keep the first record of each id1
UPDATE upd
   SET keep_this_record = 1,
       why = upd.why + 'first id2 for id1 = ' + Convert(varchar, id1) + ','
  FROM #working upd
 WHERE id2_seqnr = 1 -- first in sequence

-- make sure we keep the last record of each id1
UPDATE #working
   SET keep_this_record = 1,
       why = upd.why + 'last id2 for id1 = ' + Convert(varchar, upd.id1) + ','
  FROM #working upd
  JOIN (SELECT id1, max_seqnr = MAX(id2_seqnr)
          FROM #working
         GROUP BY id1) mx
    ON upd.id1 = mx.id1
   AND upd.id2_seqnr = mx.max_seqnr

-- check if val1 has changed versus the previous record
UPDATE upd
   SET keep_this_record = 1,
       why = upd.why + 'val1 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val1 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
  FROM #working upd
  JOIN #working prev
    ON prev.id1 = upd.id1
   AND prev.id2_seqnr = upd.id2_seqnr - 1
   AND prev.val1 <> upd.val1 

-- check if val1 has changed versus the previous record
UPDATE upd
   SET keep_this_record = 1,
       why = upd.why + 'val2 for ' + Convert(varchar, upd.id1) + '/' + Convert(varchar, upd.id2) + ' differs from val2 for ' + Convert(varchar, prev.id1) + '/' + Convert(varchar, prev.id2) + ','
  FROM #working upd
  JOIN #working prev
    ON prev.id1 = upd.id1
   AND prev.id2_seqnr = upd.id2_seqnr - 1
   AND prev.val2 <> upd.val2

-- delete those records we do not want to keep
DELETE del
  FROM STATUS_DATA del
  JOIN #working w
    ON w.id1 = del.id1
   AND w.id2 = del.id2
   AND w.keep_this_record = 0

-- some info
SELECT TOP 500 * FROM #working ORDER BY id1, id2
SELECT TOP 500 * FROM STATUS_DATA ORDER BY id1, id2

Need Better Algorithm to Scrub SQL Server Table with Java

3 Answers3