1

I need to insert 60K rows into a Postgres DB in my Java/Spring application, using Hibernate/Spring Data.

The INSERT data that goes in is (1) USERS_T, (2) the associated new Users must also be in STUDY_PARTICIPANTS_T. Both of these are for 60K records each.

The below is working, but the performance is poor: 60K takes 2 minutes. Note that I'm filling out the Hibernate entity and then doing saveAll based on lists of size 1000.

        UsersT user = new UsersT();
        user.setUsername(study.getAbbreviation().toUpperCase()+subjectId);
        user.setRoleTypeId(new LookupT(150));
        user.setCreatedDate(new Date());
        //...
        List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
        StudyParticipantsT sp = new StudyParticipantsT();
        sp.setStudyT(study);
        sp.setUsersT(user);
        sp.setSubjectId(subjectId);
        sp.setLocked("N");
        participants.add(sp);
        user.setStudyParticipantsTs(participants);

        // Add to Batch-Insert List; if list size ready for batch-insert, or if at the end of all subjectIds, do Batch-Insert saveAll() and clear the list
        batchInsertUsers.add(user);
        if (batchInsertUsers.size() == 1000 || i == subjectIds.size() - 1) {
            // Log this Batch-Insert
            if(log.isDebugEnabled()){
                log.debug("createParticipantsAccounts() Batch-Insert: Saving " + batchInsertUsers.size() + " records");
            }
            userDAO.saveAll(batchInsertUsers);
            // Reset list
            batchInsertUsers.clear();
        }          

I found a thread where someone was having the same problem, and the only solution they found is to compose a custom Native-SQL INSERT (..), (..), (..) string for each chunk of 1000, and run that manually, cutting out the ORM/Hibernate layer entirely: Need to insert 100000 rows in mysql using hibernate in under 5 seconds

But my INSERTs involve some joined tables. I could take the time to rewrite all these entity statements into a custom SQL myself, but it wouldn't be straightforward.

Are there any other solutions? I'm using - Spring 5.0.2 - Hibernate5.2.12

gene b.
  • 10,512
  • 21
  • 115
  • 227
  • You will have to call clear() on the entitymanager else the persisted entities stay in the first level cache as they are managed. This could already be enough of a performance gain. – Martin Frey Feb 20 '20 at 15:19
  • I don't know anything about Hibernate, but make sure that you are using prepared statements and run all inserts in a single transaction. – Laurenz Albe Feb 20 '20 at 15:24
  • 1
    If you want to make it fast, don't use Hibernate. –  Feb 20 '20 at 15:57
  • @Martin, I added `entityManager.clear();` right after `userDAO.saveAll(batchInsertUsers)` and I have the new injected variable `@Autowired private EntityManager entityManager;`. Now, it completed in 48 sec. but nothing was persisted in the DB. I'm guessing I shouldn't just do `entityManager.clear()` in the middle of this code, I should rewrite it to do Open/Session with `entityManager`? Right now it's just based on the Spring Data objects. – gene b. Feb 20 '20 at 16:05
  • 1
    Your bottleneck could be IO from index maintenance in PostgreSQL. Or it could be something else. Can you use OS monitoring tools to figure that out? "I could take the time to rewrite all these entity statements into a custom SQL myself, but it wouldn't be straightforward." Would it be easier to write such an SQL and test it manually to see how it performs, without integrating it into hibernate? – jjanes Feb 20 '20 at 16:34
  • @MartinFrey - I rewrote it with `entityManager.clear`/`entityManager.flush`. Unimpressive: The new metrics are 1m54sec instead of 2m02sec. – gene b. Feb 20 '20 at 16:38

1 Answers1

0

We improved performance by using SpringJDBC's jdbcTemplate.batchUpdate (no Hibernate) and reserving a Sequence range in advance for any foreign keys.

We didn't get down to the level of actual N repeated INSERT statements, which the other poster referenced above did; we're still using a framework approach (JDBCTemplate), but at least we don't use Hibernate/ORM anymore. This approach is fast, but it's not as super-fast as the N repeated INSERTs -- but it's acceptable now.

The actual SpringJDBC Batch-Insert occurs via jdbcTemplate.batchUpdate(sqlInsert, new BatchPreparedStatementSetter() {..}, and we actually split up the batches ourselves -- the BatchPreparedStatementSetter won't automatically split anything up for us, it will just submit that particular batch with a predetermined size.

/**
 * The following method performs a Native-SQL Batch-Insert of Participant accounts (using JdbcTemplate) to improve performance.
 * Each Participant account requires 2 INSERTs: (1) USERS_T, (2) STUDY_PARTICIPANTS_T (with an FK reference to USERS_T.ID).
 * Since there is no easy way to track the Sequence IDs between the two Batch-Inserts, we reserve the ID range for both tables, and 
 * then manually calculate our own IDs for USERS_T and STUDY_PARTICIPANTS_T ourselves.
 * Initially, domain objects are filled out; then they are added to the Batch List that we submit and clear ourselves.
 * (Originally the Batch-Insert was implemented with Hibernate/HQL, but due to slow performance it was nativized with jdbcTemplate.)
 * 
 * NOTE: The entire method is @Transactional and all data will be rolled back in case of any exceptions in this method (rollbackFor=Exception.class).
 * The updated Sequence values (set during reservation) will not be rolled back in this case, but Sequence gaps are normal. 
 */
@Override
@Transactional(readOnly = false, rollbackFor = Exception.class)
public void createParticipantsAccounts(long studyId, List<String> subjectIds) throws Exception {

    int maxInsertParticipantsBatchSize = 1000; // Batch size is 1000
    
    /*
      We need to insert into 2 tables, USERS_T and STUDY_PARTICIPANTS_T. 
      The table STUDY_PARTICIPANTS_T has an FK dependency on USERS_T.ID.
      Since there is no easy way to track the Sequence IDs between the two Batch-Inserts, we reserve the ID range for both tables, 
      and then manually calculate our own IDs for USERS_T and STUDY_PARTICIPANTS_T ourselves.
      The Sequences are immediately updated to the calculated final values to reserve the range. 
     */
    // 1. Obtain current Sequence values
    Integer currUsersTSeqVal = userDAO.getCurrentUsersTSeqVal();
    Integer currStudyParticipantsTSeqVal = studyParticipantsDAO.getCurrentStudyParticipantsTSeqVal();
    // 2. Immediately update the Sequences to the calculated final value (this reserves the ID range immediately)
    // In Postgres, updating the Sequences is: SELECT setval('users_t_id_seq', :val)
    userDAO.setCurrentUsersTSeqVal(currUsersTSeqVal + subjectIds.size());
    studyParticipantsDAO.setCurrentStudyParticipantsTSeqVal(currStudyParticipantsTSeqVal + subjectIds.size());                          
    
    // List for Batch-Inserts, maintained and submitted by ourselves in accordance with our batch size
    List<UsersT> batchInsertUsers = new ArrayList<UsersT>();        
    
    for(int i = 0; i < subjectIds.size(); i++) {
        
        String subjectId = subjectIds.get(i);           
        
        // Prepare domain object (UsersT with associated StudyParticipantsT) to be used in the Native-SQL jdbcTemplate batchUpdate
        UsersT user = new UsersT();
        user.setId(currUsersTSeqVal + 1 + i); // Set ID to calculated value
        user.setUsername(study.getAbbreviation().toUpperCase()+subjectId);
        user.setActiveFlag(true);
        // etc., fill out object, then subobject:
        List<StudyParticipantsT> participants = new ArrayList<StudyParticipantsT>();
        StudyParticipantsT sp = new StudyParticipantsT();
        sp.setId(currStudyParticipantsTSeqVal + 1 + i); // Set ID to caculated value
        // ...etc.
        user.setStudyParticipantsTs(participants);
        
        // Add to Batch-Insert List of Users
        batchInsertUsers.add(user);
        
        // If list size ready for Batch-Insert, or if at the end of all subjectIds, perform Batch Insert (both tables) and clear list
        if (batchInsertUsers.size() == maxInsertParticipantsBatchSize || i == subjectIds.size() - 1) {
            
            // Part 1: Insert batch into USERS_T
            nativeBatchInsertUsers(jdbcTemplate, batchInsertUsers);             
            // Part 2: Insert batch into STUDY_PARTICIPANTS_T
            nativeBatchInsertStudyParticipants(jdbcTemplate, batchInsertUsers);             
            // Reset list
            batchInsertUsers.clear();
        }
    }
}

The sub-methods for the actual Batch-Insert:

/**
 * Native-SQL Batch-Insert into USERS_T for Participant Upload.
 * NOTE: This method is part of its Parent's @Transactional. (Note also that we need "final" on the List param for Inner-Class access to this variable.)
 *  
 * @param jdbcTemplate
 * @param batchInsertUsers
 */
private void nativeBatchInsertUsers(JdbcTemplate jdbcTemplate, final List<UsersT> batchInsertUsers) {

    String sqlInsert =  "INSERT INTO PUBLIC.USERS_T (id, password, user_name, created_by, created_date, last_changed_by, last_changed_date, " + 
                                                    "first_name, last_name, organization, phone, lockout_date, lockout_counter, last_login_date, " + 
                                                    "password_last_changed_date, temporary_password, active_flag, uuid, " + 
                                                    "role_type_id, ws_account_researcher_id) " +
                        "VALUES (?, ?, ?, ?, ?, ?, ?, " +
                                "?, ?, ?, ?, ?, ?, ?, " + 
                                "?, ?, ?, ?, " + 
                                "?, ?" +
                                ") ";

    
    jdbcTemplate.batchUpdate(sqlInsert, new BatchPreparedStatementSetter() {

        @Override
        public int getBatchSize() {
            return batchInsertUsers.size();
        }

        @Override
        public void setValues(PreparedStatement ps, int i) throws SQLException {
            ps.setInt(1, batchInsertUsers.get(i).getId()); // ID (provided by ourselves)
            // etc., set PS for each i-th object

        }       
        
    });
    
}

/**
 * Native-SQL Batch-Insert into STUDY_PARTICIPANTS_T for Participant Upload.
 * NOTE: This method is part of its Parent's @Transactional. (Note also that we need "final" on the List param for Inner-Class access to this variable.)
 *  
 * @param jdbcTemplate
 * @param batchInsertUsers
 */ 
private void nativeBatchInsertStudyParticipants(JdbcTemplate jdbcTemplate, final List<UsersT> batchInsertUsers) {
    
    String sqlInsert =  "INSERT INTO PUBLIC.STUDY_PARTICIPANTS_T (id, study_id, subject_id, user_id, locked, " +                                                                     "created_by, created_date, last_changed_by, last_changed_date) " + 
                        "VALUES (?, ?, ?, ?, ?, " +
                                "?, ?, ?, ? " +
                                ") ";
            
    jdbcTemplate.batchUpdate(sqlInsert, new BatchPreparedStatementSetter() {

        @Override
        public int getBatchSize() {
            return batchInsertUsers.size();
        }   
        
        @Override
        public void setValues(PreparedStatement ps, int i) throws SQLException {            
            
            ps.setInt(1, batchInsertUsers.get(i).getStudyParticipantsTs().get(0).getId()); // ID (provided by ourselves)
            // etc. 
        }
        
    });
    
}       
gene b.
  • 10,512
  • 21
  • 115
  • 227