0

i have a large DB table which the user must be able to export as csv.

Backend -> Query DB -> Transform to CSV -> download file over HTTP

For small amount of data this is trivial. But the table will have millions of records in it, so i want to avoid loading all of it into memory, then transform all of it and then download it.

What i want ist a way to stream the data. So only touch ~1000 records at a time and directly flush them down the http pipe to the browser for download and then freeing the memory right after.

This is what i have tried:

@PostMapping("/csv/{tableName}")
public void getFilteredExceptionsForCsv(
        @NotNull @PathVariable String tableName,
        @NotNull @RequestBody EnterpriseGetRowsRequest req,
        HttpServletResponse response) {

    response.setContentType("text/csv");
    String headerKey = "Content-Disposition";
    String headerValue = String.format("attachment; filename=\"%s\"", "filename.csv");
    response.setHeader(headerKey, headerValue);

    // simple select query
    String sql = gridRowsRepositoryCustom.getFilteredExceptionsForCsvQuery(req, tableName, getCurrentUser());

    try {
        Statement statement = null;
        statement = db.getDataSource().getConnection().createStatement();
        statement.setFetchSize(50);

        ResultSet rs = statement.executeQuery(sql);
        logger.debug("EXECUTED QUERY");
        ResultSetMetaData rsmd = rs.getMetaData();
        int columnsCount = rsmd.getColumnCount();

        while (rs.next()) {
            String[] values = new String[columnsCount];
            for (int i = 1; i <= columnsCount; i++) {
                String columnValue = rs.getString(i);
                values[i-1] = columnValue;
            }
            String line = Stream.of(values)
                    .map(this::escapeSpecialCharacters)
                    .collect(Collectors.joining(","));
            response.getWriter().println(line);
            response.getWriter().flush();
        }
        logger.debug("CONTENT WRITTEN");
        rs.close();
        statement.close();
        response.getWriter().close();
    } catch (SQLException | IOException e ) {
        logger.error("error executing sql statement {}", e.getMessage());
    }
}

My expectation was that the JDBC driver will read only 50 records at a time and directly write them into the response stream, then get the next 50 records and so on. So the Download would start as soon as the first 50 entries were flushed.

Instead it looks like all records are fetched first, and then written to the output stream, then the request is completed and the browser begins downloading the file.

Here's the log for loading 1m records:

12:03:16.542 DEBUG 27992 --- [nio-8181-exec-8] s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped to DataController#getFilteredExceptionsForCsv(String, RowsRequest, HttpServletResponse)
12:03:16.550 DEBUG 27992 --- [nio-8181-exec-8] d.t.z.d.repository.GridRowsRepository    : EXPORT SQL: SELECT TARGET.*, ...[runcated]
12:04:20.342 DEBUG 27992 --- [nio-8181-exec-8] d.t.zebra.api.controller.DataController  : EXECUTED QUERY
12:04:44.208 DEBUG 27992 --- [nio-8181-exec-8] d.t.zebra.api.controller.DataController  : CONTENT WRITTEN
12:04:44.208 DEBUG 27992 --- [nio-8181-exec-8] o.s.web.servlet.DispatcherServlet        : Completed 200 OK

As you can see executing the query takes about a minute, then transforming and writing to the output stream another 24 seconds - then the method finishes AND THEN the browser starts downloading the file.

I also tried the Spring version of it by using ResponseEntity<StreamingResponseBody> but the result was exactly the same.

So what am i doing wrong? Is what i described above even possible?

Any help appreciated.

Sebastian
  • 786
  • 1
  • 8
  • 22

2 Answers2

0

For starters I suggest you start using the JdbcTemplate (which I assume is what your db property is). Also your code is flawed (somewhat) in case of an error you might leak a connection, nor are your closing the JDBC connection you opened.

Using the JdbcTemplate will cleanup your code, then using a RowCallBackHandler you can decide what to do with a row. In your case write it to the output stream or writer.

@PostMapping("/csv/{tableName}")
public void getFilteredExceptionsForCsv(
        @NotNull @PathVariable String tableName,
        @NotNull @RequestBody EnterpriseGetRowsRequest req,
        HttpServletResponse response) {

    String headerKey = "Content-Disposition";
    String headerValue = String.format("attachment; filename=\"%s\"", "filename.csv");

    response.setContentType("text/csv");
    response.setHeader(headerKey, headerValue);

    // simple select query
    String sql = gridRowsRepositoryCustom.getFilteredExceptionsForCsvQuery(req, tableName, getCurrentUser());
    // You could make this a method argument as well. 
    ServletOutputStream out = response.getOutputStream();
    int columnsCount = -1;
    db.query(query, (rs) -> {
      if (columnsCount == -1) {
        ResultSetMetaData rsmd = rs.getMetaData();
        columnsCount = rsmd.getColumnCount();
      }
      for (int i = 1; i <= columnsCount; i++) {
        if (i > 1) { out.print(',')}
        out.print(this::escapeSpecialCharacters(rs.getString(i)));
      }
      out.println();
      out.flush();
    });
}

NOTE: As mentioned it is assumed that db is the JdbcTemplate if not inject it.

You might want to increase the fetch size, as the lower the fetch size the more roundtrips to the DB server needs to be done. If you really want to speedup things, you don't do the CSV part in Java but just return a single result which is the CSV record.

If you want more control over what you receive or hor the Statement is being created you can use a Statement created and create a forward only result (depending on if your DB supports it or not).

M. Deinum
  • 115,695
  • 22
  • 220
  • 224
  • hi, thanks for the answer - you are right, it is a JdbcTemplate - i also tried the way you described, unfortunately it results in the same behavior (first everything loads, then the response is downloaded) – Sebastian Aug 25 '20 at 13:59
  • Could this be a driver issue? I am currently testing this on an Oracle 11g with ojdbc6 – Sebastian Aug 25 '20 at 14:07
  • It has nothing to do with the JDBC side, but everything with the servlet part I suspect. It needs the `Content-Size` to instruct the client what/how much to download. You can try setting the content-size to -1 `response.setContentLength(-1)` (although it might crash). If that doesn't solve it, or it leads to an exception. try adding `response.flushBuffer()` after the `out.flush()`. Or keep a counter and flush after x rows. – M. Deinum Aug 25 '20 at 17:35
  • The content-size is not needed for this and doesn't change anything. The server first finishes everything and then sends the response including the matching content-size header – Sebastian Aug 27 '20 at 08:42
  • And thus the content-size matters. The content-size cannot be calculated unless everything is collected first. You cannot send a header AFTER you send the content. So as soon as transmission has begon you cannot send headers anymore. – M. Deinum Aug 27 '20 at 09:14
0

I (partly) managed to get this working using this answer: How To Stream Chunked Response With Spring Boot @RestController

The part where the browser directly begins downloading when the first entries are written into the output stream works - but the DB does not split the querying into batches but executes one large query. (Fetch size seems to be ignored)

Because of that the observed RAM consumption is still pretty high, while downloading the whole set of data takes now about 5 minutes instead of ~1 min before (no streaming) - which also makes sense, kind of.

Since the streaming part was the original question and it's working now, i'll mark this as the answer.

Sebastian
  • 786
  • 1
  • 8
  • 22