Batch read from DBs

Question

Im a bit confused on how golangs sql package reads large datasets into memory. In this previous stackoverflow question - How to set fetch size in golang?, there seems to be conflicting ideas on whether batching of large datasets on read happens or not.

I am writing a go binary that connects to different remote DBs based on input params given and fetches resutls and subsequently converts them to a csv file. Suppose I have a query that returns a lot of rows; say 20 million rows. Loading this all at once in memory would be very exhaustive. Does the library batch the results automatically and only on row.Next() load the next batch into memory ?

If the db/sql package does not handle it, are there options in the various driver packages ?

https://github.com/golang/go/issues/13067 - From this issue and discussion, I understand that the general idea is to have the driver packages handle this. As mentioned in the issue and also in this blog https://oralytics.com/2019/06/17/importance-of-setting-fetched-rows-size-for-database-query-using-golang/, I found out that golangs oracle driver package has this option that I can pass for batching. But am not able to find an equivalent in the other driver packages.

To summarize -

Does db/sql batch read results automatically.
- If yes, then my 2nd & 3rd question does not matter
If no, are there options that I can pass to the various driver pacakges to set the batch size and where can I find what these options are. I have already tried looking at pgx docs and cannot find anything there that sets a batch size.
Is there any other way to batch reads like a prepared statement with configuration specifying the batch size ?

Some clarifications:

My question is when the a query returns a large dataset, is the entire dataset loaded into memory or is it batched whether internally by some code that is called downstream from rows.Next or not.

From what I can see there is a chunk reader that gets created with a default 8kb size and is used to chunk. Are there cases where this does not happen ? Or are the results from db always chunked.

Is there any way this 8kb buffer size that the chunk reader uses configurable ?

For more clarity, I am adding what is existing in java. This is what already exists and I am looking to rewrite it in golang.


private static final int RESULT_SIZE = 10000;
private void generate() {
... //connection and other code...

  Statement stmt = connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, 
  ResultSet.CONCUR_READ_ONLY);
  stmt.setFetchSize(RESULT_SIZE);
  ResultSet resultset = stmt.executeQuery(dataQuery);
  String fileInHome = getFullFileName(filePath, manager, parentDir);
  rsToCSV(resultset, new BufferedWriter(new FileWriter(fileInHome)));
}

private void rsToCSV(ResultSet rs, BufferedWriter os) throws SQLException {
    ResultSetMetaData metaData = rs.getMetaData();
    int columnCount = metaData.getColumnCount();
    try (PrintWriter pw = new PrintWriter(os)) {
      readHeaders(metaData, columnCount, pw);
      if (rs.next()) {
        readRow(rs, metaData, columnCount, pw);
        while (rs.next()) {
          pw.println();
          readRow(rs, metaData, columnCount, pw);
        }
      }
    }
  }

The stmt.setFetchSize(RESULT_SIZE); sets the number of rows to return in each result set which is then processed one by one to a csv.

You can try this: https://pkg.go.dev/github.com/jmoiron/sqlx — Tiago Peczenyj, Mar 27 '22 at 08:28
What exactly in that package would help me? I dont see any diff between sqlx and the sql pacakge taht would help me — leoOrion, Mar 27 '22 at 08:31
However you can implement a simple pagination with LIMIT and OFFSET and use sqlx to load directly to a slice of object — Tiago Peczenyj, Mar 27 '22 at 08:38
Without change the query you can try to adapt the existing code with some iterator and some generics/ callback. — Tiago Peczenyj, Mar 27 '22 at 09:27
However pagination is just a tecnique to consume data in slices. The advantage is you can stop/ continue if needed. But if you need to read a lot if data you can just loop each line, and use buffered io. Depends on your bottleneck. When you select from dabatase you start a streamming but you still need to decode each row (and it can fail) — Tiago Peczenyj, Mar 27 '22 at 09:35
But I am afraid that what you need is driver-depend. If your driver does not support such option you can try to submit a pull request :) — Tiago Peczenyj, Mar 27 '22 at 09:44
@leoOrion The `lib/pq` driver for PostgreSQL uses a server side cursor, so the `*sql.Rows` result to a query that selects 5 rows and a query that selects 5 million rows is of the same size in your Go program's memory. How you scan and batch those rows is entirely up to you. — mkopriva, Mar 27 '22 at 10:20
@leoOrion The `go-sql-driver/mysql` driver for MySQL does not use cursors, as far as I know, but its `rows.Scan` implementation reads the results directly from the connection **in chunks**. So, again, a 5M `*sql.Rows` result by itself should not worry you, as far as memory is concerned. It's up to you to implement the scanning of those rows in such a way as to not eat up all of the host's resources. — mkopriva, Mar 27 '22 at 10:29
even if the go program would batch it, at the end you would write a 20 million line csv and hand it to the user. I think thats not a good idea. You should impose some pagination, IMO. — The Fool, Mar 27 '22 at 11:29
There is no end user in my case. The CSV will get tarred and uploaded to s3. I dont need pagination. Just a batched read. From what I understand the database package reads in chunks automatically. I also went throught the sql package code and profiled `rows.Next`. I can see that the data is read by somthing called a ChunkReader that has a default size. So my assumption is that the database package does this automatically and I dont need any specific options to be passed to the driver packges. CMIIW — leoOrion, Mar 27 '22 at 13:59
well something or someone is the *user*. I am not talking about enduser in the sense if your clients. I am saying producing a 20 million line file is always a bad idea. Some program will need to deal with it sooner or later. May it be for reading like an editor or even for further processing. It will be fun times. You just shifted your problem to somewhere else, but you still have the same problem. — The Fool, Mar 27 '22 at 15:33
`rows.Next` is used, gets you the very next row. It's not for batching or something like that. Another possible problem is, if you batch, and the records change while you are batching. Or you had to open a transaction and leave it open for a prolonged time. All in all, this doesnt seem like a good idea. — The Fool, Mar 27 '22 at 15:38
A batched read is the same thing as pagination when it comes to SQL. I'd recommend requesting from the DB in batches, in other words `LIMIT 5000` then `LIMIT 5000, OFFSET 5000` etc. This will be inefficient for 20 million rows though so you'll want to read up on how to do efficient pagination. — D-S, Mar 28 '22 at 03:46
@TheFool I have no problem in the processing of large csv files. That is already happening in production and is not the issue at hand. This process right now is happenign right now in java nad is being rewriten in golang. So was checking if golangs support for batched reads exists. I know `rows.Next` reads only the next row. I have added some more clarification to my question. — leoOrion, Mar 28 '22 at 05:48
This question is getting too large. I have added another question which is more rpecise to what I am looking for - https://stackoverflow.com/questions/71643058/does-db-sql-batch-large-datasets-on-reading-into-memory-is-it-configurable — leoOrion, Mar 28 '22 at 06:11

Batch read from DBs

0 Answers0