2

I have tasked with reading 15+ million records from a SQL Server database, performing some processing on them, and writing the results to a flat file.

How can I do this efficiently using Java?

My intial thoughts are to query the data in chunks or to stream the results back for processing while the query is executing(if that is even possible).

SpeaksBinary
  • 185
  • 1
  • 17
  • Does this need to be Java? This sounds like exactly the kind of thing [SSIS](http://msdn.microsoft.com/en-us/library/ms141026.aspx) is intended for. – GarethD May 28 '13 at 13:53
  • Can you do the processing in the database? Much faster there, where it can take advantage of a multi-threaded environment. – Gordon Linoff May 28 '13 at 13:54
  • I've actually already implemented this in SSIS(it worked beatifully), but that solution was shot down due to office politics and infrastructure issues. I'm being asked to due it in Java because that's what everyone in my workplace is comfortable with. – SpeaksBinary May 28 '13 at 13:59
  • ask if they´re comfortable buying more hardware ;) – jambriz May 28 '13 at 14:05
  • What kind of processing required? – Stoleg May 28 '13 at 15:30
  • The processing required is trivial. Most of it is just mapping one value to another, data formatting, ect.. – SpeaksBinary May 28 '13 at 16:02

2 Answers2

3

It looks like the jdbc driver for sql-server respects the fetchsize hint, (which suggests how many rows to read at a time) so you should be able to issue one query and iterate through your resultset, processing and writing rows to the file as you go. E.g.:

public static void toFlat(Connection conn, File file, String destcode) {
    PreparedStatement ps = null;
    ResultSet rs = null;
    BufferedWriter out = null;
    try {
        ps = conn.prepareStatement(
            // col#:   1         2         3           4
            "SELECT threatid, lastname, firstname, flightnum " +
            "FROM travel.passengers " +
            "JOIN threats.aliases USING (firstname, lastname) " +
            "WHERE destination = ?" // param# 1
        );
        ps.setString(1,destcode); // param# 1

        out = new BufferedWriter(new FileWriter(file));


        // provides hint for driver to load x rows at a time:
        ps.setFetchSize(1000); 
        ps.executeQuery();
        rs = ps.getResultSet();

        while(rs.next()) {
            Integer threatid = rs.getInt(1);
            String lastname = rs.getString(2);
            String firstname = rs.getString(3);
            Integer flightnum = rs.getInt(4);

            //rubber meets road:
            String row = processRow(threatid, lastname, firstname, flightnum);
            out.write(row);
        }
    } catch(SQLException e) {
        // TODO
    } catch (IOException e) {
        // TODO
        e.printStackTrace();
    } finally {
        try {
            ps.close();
        } catch(Exception e){
            //TODO
        }
        try {
            rs.close();
        } catch(Exception e){
            //TODO
        }
        try {
            out.close();
        } catch(Exception e){
            //TODO
        }
    }
}
maybeWeCouldStealAVan
  • 15,492
  • 2
  • 23
  • 32
2

If efficiently is only related to reading, the answer regarding fetch size is the way to go. If you ran the Java program on the database host (=> localhost connection), it would give you a performance boost.

If efficiently applies to processing as well, do as much as possible in your SQL query. We have run measurements, a RDBMS outperforms Java. For example, filtering and sorting takes longer in Java. It's just makes no sense to re-implement database functionality in Java again, and it's slower.

If your algorithm is not easy to implement using a SQL query, do additional (procedural) processing in a stored procedure, write it completely as a stored procedure or use a stored function in your SQL query. Using stored functions with a SQL query is a really powerful and fast combination. Your Java client just reads the results, and writes them directly to disk. No buffering, no processing, just I/O.

If you were using Oracle, PostgreSql or DB2, you could even write the stored procedures/functions in Java.

Beryllium
  • 12,808
  • 10
  • 56
  • 86