Faster batch loading of nested data

Question

I am populating two tables, which have a 1-many relationship.

So I insert a line in outer, get the (autoincrement primary key) id for that line, and then insert 100 lines into inner (all with a foreign key pointing to outer.id).

Then I repeat, 50 times. For every entry in outer I have to insert, read id, and then insert into inner.

This is slow. Most of the time is spent in loading the 100 lines into inner. I suspect it would be much faster if I could insert all 50*100 lines into inner in one batch operation. But I cannot see how to do that - how can I make the foreign keys work?

How do other people make this efficient?

I am using Java / Spring. The 100 lines are inserted with a JdbcTemplate.batchUpdate().

public final void insert(final JdbcTemplate db,
                         final Iterable<DataBlock> data) {
    String insertSql = getInsertSql();
    String idQuery = getIdQuery();
    ItemRowMapper.IdRowMapper mapper = new ItemRowMapper.IdRowMapper();
    for (DataBlock block: data) {
        Object[] outer = block.getOuter();
        LOG.trace("Loading outer");
        db.update(insertSql, outer);
        LOG.trace("Getting index");
        // currently retrieve index based on natural key, but could use last index
        int id = db.query(idQuery, mapper, uniqueData(outer)).get(0);
        LOG.trace("Getting inner");
        List<Object[]> inner = block.getInner(id);
        // most time spent here
        LOG.trace(format("Loading inner (%d)", inner.size()));
        innerTable.insert(db, inner);
    }
}

And pseudo-SQL:

create table outer (
    integer id primary key autoincrement,
    ...
);

create table inner (
    integer outer references outer(id),
    ...
);

Update - The following appears to work with Spring 3.1.1 and Postgres 9.2-1003.jdbc4.

/**
 * An alternative implementation that should be faster, since it inserts
 * in just two batches (one for inner and one fo router).
 *
 * @param db A connection to the database.
 * @param data The data to insert.
 */
public final void insertBatchier(final JdbcTemplate db,
                                 final AllDataBlocks data) {
    final List<Object[]> outers = data.getOuter();
    List<Integer> ids = db.execute(
            new PreparedStatementCreator() {
                @Override
                public PreparedStatement createPreparedStatement(
                        final Connection con) throws SQLException {
                    return con.prepareStatement(getInsertSql(),
                            Statement.RETURN_GENERATED_KEYS);
                }
            },
            new PreparedStatementCallback<List<Integer>>() {
        @Override
        public List<Integer> doInPreparedStatement(final PreparedStatement ps)
                throws SQLException {
            for (Object[] outer: outers) {
                for (int i = 0; i < outer.length; ++i) {
                    setParameterValue(ps, i + 1,
                            SqlTypeValue.TYPE_UNKNOWN, outer[i]);
                }
                ps.addBatch();
            }
            ps.executeBatch();
            RowMapperResultSetExtractor<Integer> ids =
                    new RowMapperResultSetExtractor<Integer>(
                            new ItemRowMapper.IdRowMapper());
            try (ResultSet keys = ps.getGeneratedKeys()) {
                return ids.extractData(keys);
            }
        }
    });
    innerTable.insert(db, data.getInner(ids));
}

I don't understand. Why not insert all the records in `outer` in one batch, and then all the records in `inner` in one batch? Why don't you post some code? — dcaswell, Sep 09 '13 at 18:09
i could do that if i could retrieve the indices for all the inserts in the outer table. is there a way to retrieve the autoincrement key over a range of values? do i have to manage the key generation manually? more than one thread may be adding/deleting data to these tables. — andrew cooke, Sep 09 '13 at 18:16
Not up to snuff on java code, but from a database point of view you have to do this in a set based approach not by row by row which is what it sounds like you are doing currently. Can you dump all your data into a temp table in postgre and execute an insert using the temp table as a source? That will insert all the data in one swoop which is cheaper than many tiny inserts. — Kuberchaun, Sep 09 '13 at 18:49
@JustBob (thanks) ignoring the java, how would you set things up so that the inner (or temp) table has foreign key references to the outer table? that's my underlying problem. every 100 rows (more or less) of the inner table reference a different outer row. and the ids are autoincrement. i am thinking i need to manage the keys myself. — andrew cooke, Sep 09 '13 at 19:02

Pace · Accepted Answer · 2013-09-10T20:24:51.510

I'm not as familiar with JdbcTemplate, but assuming it is similar to JDBC I would do it with something similar (I would probably break this into multiple methods) to the following code:

private static final int BATCH_SIZE = 50;

public void addBatch(Connection connection, List<Outer> outers) {

  PreparedStatement outerInsertStatement = connection.prepareStatement("...", Statement.RETURN_GENERATED_KEYS);
  PreparedStatement innerInsertStatement = connection.prepareStatement("...", Statement.RETURN_GENERATED_KEYS);

  List<Integer> outerIds = new ArrayList<Integer>();

  for(Outer outer : outers) {
    outerInsertStatement.setParameter(...);
    ...
    outerInsertStatement.setParameter(...);

    outerInsertStatement.addBatch();
  }

  outerInsertStatement.executeBatch();
  //Note, this line requires JDBC3
  ResultSet primaryKeys = outerInsertStatement.getGeneratedKeys();
  while(!primaryKeys.isAfterLast()) {
    outerIds.add(primaryKeys.getInt(0));
  }

  for(int i = 0; i < outers.size(); i++) {
    Outer outer = outers.get(i);
    Integer outerId = outerIds.get(i);
    for(Inner inner : outer.getInners()) {
      //One of these setParameter calls would use outerId
      innerInsertStatement.setParameter(...);
      ...
      innerInsertStatement.setParameter(...);
      innerInsertStatement.addBatch();

      if( (i+1) % BATCH_SIZE == 0) {
        innerInsertStatement.executeBatch();
      }
    }
    innerInsertStatement.executeBatch();
  }
}

awesome. i had no idea getgeneratedkeys existed. thanks. (looking around it seems like it needs a very recent jdbc and postgres lib, but that shouldn't be an issue). — andrew cooke, Sep 10 '13 at 00:47
have you got this approach to work with Postgres, batch loading, and generated keys? i tried it (see my code in the "update" part of the question), but it seems postgres driver doesn't support this (it's optional in the jdbc standard - http://stackoverflow.com/questions/15684297/how-to-get-generated-keys-from-jdbc-batch-insert-in-oracle) — andrew cooke, Sep 10 '13 at 13:11
Ah, no, sorry. I tend to work in MySQL and haven't worked in Postgres. You may need to do `con.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS)` when preparing your statement. — Pace, Sep 10 '13 at 15:48
i read that and forgot it. dumb. but you are even awesomer. that appears to work :o) — andrew cooke, Sep 10 '13 at 19:31

Faster batch loading of nested data

1 Answers1