25

I am trying to insert 100,000 rows in a MYSQL table under 5 seconds using Hibernate(JPA). I have tried every trick hibernate offers and still can not do better than 35 seconds.

1st optimisation : I started with IDENTITY sequence generator which was resulting in 60 seconds to insert. I later abandoned the sequence generator and started assigning the @Id field myself by reading the MAX(id) and using AtomicInteger.incrementAndGet() to assign fields myself. That reduced the insert time to 35 seconds.

2nd optimisation : I enabled batch inserts, by adding

<prop key="hibernate.jdbc.batch_size">30</prop> <prop key="hibernate.order_inserts">true</prop> <prop key="hibernate.current_session_context_class">thread</prop> <prop key="hibernate.jdbc.batch_versioned_data">true</prop>

to the configuration. I was shocked to find that batch inserts did absolutely nothing to decrease insert time. It was still 35 seconds!

Now, I am thinking about trying to insert using multiple threads. Anyone has any pointers? Should I have chosen MongoDB?

Below is my configuration: 1. Hibernate configuration `

<bean id="entityManagerFactoryBean" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
        <property name="dataSource" ref="dataSource" />
        <property name="packagesToScan" value="com.progresssoft.manishkr" />
        <property name="jpaVendorAdapter">
            <bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter" />
        </property>
        <property name="jpaProperties">
            <props>
                <prop key="hibernate.hbm2ddl.auto">${hibernate.hbm2ddl.auto}</prop>
                <prop key="hibernate.dialect">${hibernate.dialect}</prop>
                <prop key="hibernate.show_sql">${hibernate.show_sql}</prop>
                <prop key="hibernate.format_sql">${hibernate.format_sql}</prop>
                <prop key="hibernate.jdbc.batch_size">30</prop>
                <prop key="hibernate.order_inserts">true</prop>
                <prop key="hibernate.current_session_context_class">thread</prop>
                <prop key="hibernate.jdbc.batch_versioned_data">true</prop>
            </props>
        </property>
    </bean>

    <bean class="org.springframework.jdbc.datasource.DriverManagerDataSource"
          id="dataSource">
        <property name="driverClassName" value="${database.driver}"></property>
        <property name="url" value="${database.url}"></property>
        <property name="username" value="${database.username}"></property>
        <property name="password" value="${database.password}"></property>
    </bean>

    <bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
        <property name="entityManagerFactory" ref="entityManagerFactoryBean" />
    </bean>



    <tx:annotation-driven transaction-manager="transactionManager" />

`

  1. Entity configuration :

`

@Entity
@Table(name = "myEntity")
public class MyEntity {

    @Id
    private Integer id;

    @Column(name = "deal_id")
    private String dealId;

    ....
    ....

    @Temporal(TemporalType.TIMESTAMP)
    @Column(name = "timestamp")
    private Date timestamp;

    @Column(name = "amount")
    private BigDecimal amount;

    @OneToOne(cascade = CascadeType.ALL)
    @JoinColumn(name = "source_file")
    private MyFile sourceFile;

    public Deal(Integer id,String dealId, ....., Timestamp timestamp, BigDecimal amount, SourceFile sourceFile) {
        this.id = id;
        this.dealId = dealId;
        ...
        ...
        ...
        this.amount = amount;
        this.sourceFile = sourceFile;
    }


    public String getDealId() {
        return dealId;
    }

    public void setDealId(String dealId) {
        this.dealId = dealId;
    }

   ...

   ...


    ....

    public BigDecimal getAmount() {
        return amount;
    }

    public void setAmount(BigDecimal amount) {
        this.amount = amount;
    }

    ....


    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

`

  1. Persisting code (service) :

`

@Service
@Transactional
public class ServiceImpl implements MyService{

    @Autowired
    private MyDao dao;
....

`void foo(){
        for(MyObject d : listOfObjects_100000){
            dao.persist(d);
        }
}

` 4. Dao class :

`

@Repository
public class DaoImpl implements MyDao{

    @PersistenceContext
    private EntityManager em;

    public void persist(Deal deal){
        em.persist(deal);
    }
}

`

Logs: `

DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:32.906 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:32.906 [http-nio-8080-exec-2] 

... ...

DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement
18:26:34.002 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)
18:26:34.002 [http-nio-8080-exec-2] DEBUG o.h.e.j.batch.internal.BatchingBatch - Executing batch size: 27
18:26:34.011 [http-nio-8080-exec-2] DEBUG org.hibernate.SQL - update deal_source_file set invalid_rows=?, source_file=?, valid_rows=? where id=?
18:26:34.015 [http-nio-8080-exec-2] DEBUG o.h.e.j.batch.internal.BatchingBatch - Executing batch size: 1
18:26:34.018 [http-nio-8080-exec-2] DEBUG o.h.e.t.i.jdbc.JdbcTransaction - committed JDBC Connection
18:26:34.018 [http-nio-8080-exec-2] DEBUG o.h.e.t.i.jdbc.JdbcTransaction - re-enabling autocommit
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.s.orm.jpa.JpaTransactionManager - Closing JPA EntityManager [org.hibernate.jpa.internal.EntityManagerImpl@2354fb09] after transaction
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.s.o.jpa.EntityManagerFactoryUtils - Closing JPA EntityManager
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.h.e.j.internal.JdbcCoordinatorImpl - HHH000420: Closing un-released batch
18:26:34.032 [http-nio-8080-exec-2] DEBUG o.h.e.j.i.LogicalConnectionImpl - Releasing JDBC connection
18:26:34.033 [http-nio-8080-exec-2] DEBUG o.h.e.j.i.LogicalConnectionImpl - Released JDBC connection

'

Kumar Manish
  • 1,166
  • 1
  • 16
  • 28
  • what is your batchsize. If from your time more is spent on I/O then you can try by increasing the batchsize.Also you can consider multi threading to insert some chunk of data.As there is no contentious point it should insert parallel way. – gati sahu May 29 '17 at 13:31
  • @gati sahu My configured batch size is 30, though I see this in logs `Executing batch size: 27` . Also I see `Executing batch size: 27` only once. Kindly see logs, which I have updated. Another point is that I have read that even with multi threading, the JDBC driver inserts are in `synchronised()` method, hence actual inserts happen one by one? – Kumar Manish May 29 '17 at 13:39
  • @JacobBelanger How would internet pose a problem? It is a localhost environment with local tomcat & mysql running! – Kumar Manish May 29 '17 at 13:40
  • 1
    Make sure you know where your bottleneck is. The insert-statements should only be executed at commit-time, so what part of the 35 seconds is used doing database-busines and which part is just hibernate-overhead? How is the heap-usage? would `flush()`ing the em in some interval help? – piet.t May 29 '17 at 13:42
  • @piet.t Most of the time I see is spent doing this: `o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?)` – Kumar Manish May 29 '17 at 13:49
  • I am not sure if the batch_insert is even happening! If I look at the logs, it looks like the inserts are still happening one by one? Also, I tried increasing batch_size to 500/1000. Value of 500 gave me inserts in 27 seconds , value of 1000 gave me 29 seconds. – Kumar Manish May 29 '17 at 13:56
  • Do you have many triggers ir constraints in your dB? Those usually slow things down. Have you tested inserting directly into MySQL, away from your code, and measured the time? – Boschi May 29 '17 at 14:09
  • No triggers. Only one foreign key association mapped by: `@OneToOne(cascade = CascadeType.ALL) @JoinColumn(name = "source_file") private SourceFile sourceFile;` – Kumar Manish May 29 '17 at 14:13
  • I think inserts are happening one by one, if you look at the logs: `org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?) o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement org.hibernate.SQL - insert into deal (amount, deal_id, timestamp, from_currency, source_file, to_currency, id) values (?, ?, ?, ?, ?, ?, ?) o.h.e.j.b.internal.AbstractBatchImpl - Reusing batch statement ` – Kumar Manish May 29 '17 at 14:24
  • 3
    1. Don't use `DriverManagerDataSource` use a proper connection pool. 2. Don't use Spring for tx management and mess around with the `hibernate.current_session_context_class` as that breaks proper integration. 3. your for loop is kind of flawed, you should flush and clear the entity manager after each x records (preferably the same as your batch size setting). – M. Deinum May 29 '17 at 14:41
  • Ok, so I updated the datasource url to include a profiler. `` Looks like batch updates are indeed happening. The single inserts in the logs are not what hibernate sends to the mysql driver. So batch inserts are happening, and now I might have to try multithreading. – Kumar Manish May 29 '17 at 14:42
  • For batch operations consider using straight JDBC instead of JPA. – Lew Bloch May 29 '17 at 16:49
  • Do you have some unique keys, `check` constraints, or something similar without proper indexes? Are you sure the issue is in the middleware, did you try running the equivalent script directly in the db? – Dragan Bozanovic May 29 '17 at 18:47

4 Answers4

29

After trying all possible solutions I finally found a solution to insert 100,000 rows under 5 seconds!

Things I tried:

1) Replaced hibernate/database's AUTOINCREMENT/GENERATED id's by self generated ID's using AtomicInteger

2) Enabling batch_inserts with batch_size=50

3) Flushing cache after every 'batch_size' number of persist() calls

4) multithreading (did not attempt this one)

Finally what worked was using a native multi-insert query and inserting 1000 rows in one sql insert query instead of using persist() on every entity. For inserting 100,000 entities, I create a native query like this "INSERT into MyTable VALUES (x,x,x),(x,x,x).......(x,x,x)" [1000 row inserts in one sql insert query]

Now it takes around 3 seconds for inserting 100,000 records! So the bottleneck was the orm itself! For bulk inserts, the only thing that seems to work is native insert queries!

Kumar Manish
  • 1,166
  • 1
  • 16
  • 28
  • 10
    The **native multi-insert query** rewrite that you did yourself, you can have the mysql/mariadb driver will do for you automatically if you set `rewriteBatchedStatements=true` for the connection. No need to rewrite the application! – Per Huss Aug 22 '17 at 13:45
  • I don't see it combined in my logs after adding `` to my `persistence.xml` This article is also not clear of where to take PGSimpleDatasource from: https://vladmihalcea.com/postgresql-multi-row-insert-rewritebatchedinserts-property/ – Zon Jan 09 '19 at 09:06
  • 1
    @PerHuss with your native query approach what will happen if one or some of the inserts fail due to some integrity constraint violation? – Simrandeep Singh Apr 01 '19 at 18:51
  • how will you handle sql injection here? – karthik selvaraj Feb 09 '22 at 03:14
7
  1. You are using Spring for managing the transaction but break it by using thread as the current session context. When using Spring to manage your transactions don't mess around with the hibernate.current_session_context_class property. Remove it.

  2. Don't use the DriverManagerDataSource use a proper connection pool like HikariCP.

  3. In your for loop you should flush and clear the EntityManager at regular intervals, preferably the same as your batch size. If you don't a single persist takes longer and longer, because when you do that Hibernate checks the first level cache for dirty objects, the more objects the more time it takes. With 10 or 100 it is acceptable but checking 10000s of objects for each persist will take its toll.

-

@Service
@Transactional
public class ServiceImpl implements MyService{

    @Autowired
    private MyDao dao;

    @PersistenceContext
    private EntityManager em;


    void foo(){
        int count = 0;
        for(MyObject d : listOfObjects_100000){
            dao.persist(d);
            count++;
            if ( (count % 30) == 0) {
               em.flush();
               em.clear();
            }    
        }
    }

For a more in depth explanation see this blog and this blog.

Vlad Mihalcea
  • 142,745
  • 71
  • 566
  • 911
M. Deinum
  • 115,695
  • 22
  • 220
  • 224
  • I did (1) & (3) and it had no effect. It still takes ~30 seconds for 100,000 inserts – Kumar Manish May 29 '17 at 17:27
  • Looks like this is as good as it gets? I will try multithreading as a last resort. – Kumar Manish May 29 '17 at 17:38
  • Multithreading will definitely help. A colleague of mine did that, splitting an ETL process in 10 subprocesses. The gain was around 7x. – Lluis Martinez Apr 10 '18 at 21:22
  • Multithreading **might** help it might also make things work. It depends on what kind of processing needs to be done, what kind of transactionality is used etc. Never underestimate the power of a single thread. – M. Deinum Apr 11 '18 at 05:21
  • I think it meant to say "Multithreading might help it might also make things worse" – OpenSource Jul 18 '23 at 07:01
2

Another option to consider is StatelessSession:

A command-oriented API for performing bulk operations against a database.

A stateless session does not implement a first-level cache nor interact with any second-level cache, nor does it implement transactional write-behind or automatic dirty checking, nor do operations cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Stateless sessions are vulnerable to data aliasing effects, due to the lack of a first-level cache.

For certain kinds of transactions, a stateless session may perform slightly faster than a stateful session.

Related discussion: Using StatelessSession for Batch processing

Justinas Jakavonis
  • 8,220
  • 10
  • 69
  • 114
  • But I need to work with Hibernate's JPA API's. hence need to work with `EntityManager` and not `Session` Is it possible with `EntityManager`s? Maybe I can try and disable the 1st level cache from `EntityManager` ? – Kumar Manish May 30 '17 at 08:33
-1

Uff. You can do a lot of things to increase speed.

1.) Use @DynamicInsert and @DynamicUpdate to prevent the DB from inserting non-empty columns and updating changed columns.

2.) Try to insert the columns directly (without using hibernate) into your database to see if hibernate is really your bottleneck.

3.) Use a sessionfactory and only commit your transaction every e.g. 100 inserts. Or only open and close the transaction once and flush your data every 100 inserts.

4.) Use the ID generation strategy "sequence" and let hibernate preallocate (via the parameter allocationsize) the IDs.

5.) Use caches.

Some of this possible solutions can have timing disadvantages when not used correctly. But you have a lot of opportunities.

M46
  • 923
  • 9
  • 20
  • 1. can actually lead to worse performance as hibernate needs to create a single query per row and needs to check if a field changed... All of that takes time and might even destroy the possibility to batch statements resulting in single statements instead of batched statements. – M. Deinum May 29 '17 at 14:52
  • Yes M. Deinum, can but must not! If you are having columns with BLOBs than those annotations give you a lot more performance because they will not be updated when unchanged! As I wrote: "some options can have timinig disadvantages". – M46 May 29 '17 at 19:50