1

I have a rest service that take xml with 400_000 records, each record contain the following fields: code,type,price.

In DB (MySql )I have table named PriceData with 2_000_000 rows. The purpose of this rest is: select all PriceDatas from DB according to code,type from XML, replace price of each PriceData with price from XML, if there is no PriceData with this code,type create new with provided price.

Now it work as : select one PriceData from DB accroding to first record from XML, set new price or create new PriceData, save PriceData and these steps repeats 400_000 times.(It takes about 5 minutes)

I want to speed up this process.

First try:

Select 1000 elements step by step from PriceData, and when all elements will be selected update them:

Code:

private void updateAll(final List<XmlData> prices/*data from xml*/) {
        int end= 1000;
        int begin= 0;
        final List<PriceData> models = new ArrayList<>();
        while(end != prices.size() || begin !=end){
           models.addAll(dao.findByPrices(prices.subList(begin,end)));
           begin = end;
           end +=1000;

       }


        final Map<String,XmlData> xmlData= prices.stream()                         
        .collect(Collectors.toMap(this::keyForPriceDate,e->e));

        final Map<String,PriceData> modelMap = models.stream()
                .collect(Collectors.toMap(this::keyForRowModel,e->e));          
        final List<PriceData> modelsToSave = new ArrayList<>();

        for(final String key : xmlData.keySet()){
            final XmlData price = xmlData.get(key);
            PriceData model = modelMap.get(key);

            if(model == null){
                model = onEmptyPriceData(price);
            }
            model.setPrice(price.getPrice());
            modelsToSave.add(model);
        }
        modelService.saveAll(modelsToSave);
    }

I convert two lists to maps to know does PriceData exist (keys for xmlData and modelMap created as (code+type))

findByPrices method create query in following format

select * from PriceData where (code =123 and type ='qwe') or (...)//and this `Or` repeats 1000 times

Now it takes 2 minutes.

Second try:

Select all PriceData from db (2 millions)

and use the algorithm above

It takes 3 minutes. First try is better but in future my rest can take 500_000 and I want to know which try will be better in this scenario or maybe there is the better way to do this task.

My select method

public List<PriceData> findBy(final List<XmlData> selectData) {
        final StringBuilder query = new StringBuilder("SELECT * from PriceData ");    
        query.append("WHERE \n");
        final Iterator<PriceRowSelectData> selectDataIterator = selectData.iterator();
        while(selectDataIterator.hasNext()){
            final PriceRowSelectData data = selectDataIterator.next();
            query.append("( \n")
                 .append("productCode = "+ data.getProductId()+" \n")
                 .append(" AND type = "+ data.getPriceind()+" \n")
                 .append(" ) \n");
            if(selectDataIterator.hasNext()){
                query.append("OR \n");
            }
        }
        final SearchResult<PriceRowModel> searchRes = search(query.toString());
        /*
         Here i use custom mapper that map list of result to my object
        */
        return searchRes.getResult();
    }
halfer
  • 19,824
  • 17
  • 99
  • 186
Almas Abdrazak
  • 3,209
  • 5
  • 36
  • 80
  • Which database dialect are you using? Oracle, Microsoft, MySQL, PostgreSQL, ...? – Andreas Jan 14 '18 at 04:31
  • @Andreas I use Mysql (add this info to question) – Almas Abdrazak Jan 14 '18 at 04:32
  • What database indexes do you have on the table? – stdunbar Jan 14 '18 at 05:03
  • Side note: your string concatenation is open to SQL Injection. Fix that, or be hacked (Andreas's [answer shows how to do it correctly](https://stackoverflow.com/a/48246782/812837)). Aside from that, the fastest way is usually to use [the bulk load utility](https://dev.mysql.com/doc/refman/5.5/en/load-xml.html) to get it into a table, then `MERGE` it. I dunno if that's callable, though. – Clockwork-Muse Jan 14 '18 at 07:39

1 Answers1

0

You should use the MySQL INSERT ... ON DUPLICATE KEY UPDATE statement, combined with JDBC batch processing. This of course assumes that code,type is the primary key, or at least a unique index.

private void updateAll(final List<XmlData> prices) throws SQLException {
    String sql = "INSERT INTO PriceData (code, type, price)" +
                " VALUES (?,?,?)" +
                " ON DUPLICATE KEY" +
                " UPDATE price = ?";
    try (PreparedStatement stmt = this.conn.prepareStatement(sql)) {
        int batchSize = 0;
        for (XmlData price : prices) {
            if (batchSize == 1000) { // flush batch every 1000
                stmt.executeBatch();
                batchSize = 0;
            }
            stmt.setInt       (1, price.getCode());
            stmt.setString    (2, price.getType());
            stmt.setBigDecimal(3, price.getPrice());
            stmt.setBigDecimal(4, price.getPrice());
            stmt.addBatch();
            batchSize++;
        }
        if (batchSize != 0)
            stmt.executeBatch();
    }
}

You can twiddle the batch size, but not flushing will use a lot of memory. I think 1000 statements per batch is good, but I have no numbers backing that.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • I wonder why I got down-voted. The "[MERGE](https://en.wikipedia.org/wiki/Merge_(SQL))" statement is specifically for handling jobs like this, and JDBC batching processing is specifically for handling jobs like this, so why is this not a good answer? The only assumption made is that `code,type` is unique, which is a very fair assumption give the described operation. – Andreas Jan 14 '18 at 05:07
  • You got down voted because using that method is inefficient, and will not produce the performance gains we're after. – Jesse Ivy Jan 14 '18 at 05:13
  • The example is nice enough though. No complaints there. – Jesse Ivy Jan 14 '18 at 05:14
  • 1
    @JesseIvy Why is it inefficient? The main performance issue here is number of turn-arounds to database. Reducing that to one statement per record processed, instead of `SELECT` + `INSERT/UPDATE`, and batching those, is by far the best way to improve performance, of course assuming that `code,type` is a unique index / primary key, for fast row lookup. Please explain your opinion about it being inefficient. – Andreas Jan 14 '18 at 05:16
  • I've tried it and found no performance gain over executing the inserts individually. Isn't an INSERT... SELECT possible here? – Jesse Ivy Jan 14 '18 at 05:41
  • This is probably the optimal solution with the information provided in the question. There is only so much you can do to speed up 400k DML statements which by nature are slow (requiring locks and updating indexes etc). However, if the data would happen to only contain a fraction of _changed information_ -- new prices and products -- you might be able to reduce the amount of statements significantly by keeping an `{id, type}->{price}` cache in the app server. This assumes that the product catalog only gets updated through this service, so that cache updates can be kept simple. – Mick Mnemonic Jan 14 '18 at 10:32