4

I have to execute an SQL query to Postgres by the following code. The query returns a huge number of rows (40M or more) and has 4 integer fields: When I use a workstation with 32Gb everything works but on a 16Gb workstation the query is very slow (due to swapping I guess). Is there any way to tell the C++ to load rows at batches, without waiting the entire dataset? With Java I never had these issues before, due to the probably better JDBC driver.

try {
        work W(*Conn);
        result r = W.exec(sql[sqlLoad]);
        W.commit();

        for (int rownum = 0; rownum < r.size(); ++rownum) {
            const result::tuple row = r[rownum];
            vid1 = row[0].as<int>();
            vid2 = row[1].as<int>();
            vid3 = row[2].as<int>();
            ..... 

    } catch (const std::exception &e) {
        std::cerr << e.what() << std::endl;
    }

I am using PostgreSQL 9.3 and there I see this http://www.postgresql.org/docs/9.3/static/libpq-single-row-mode.html, but I do not how to use it on my C++ code. Your help will be appreciated.

EDIT: This query runs only once, for creating the necessary main memory data structures. As such, tt cannot be optimized. Also, pgAdminIII could easily fetch those rows, in under one minute on the same (or with smaller RAM) PCs. Also, Java could easily handle twice the number of rows (with Statent.setFetchSize() http://docs.oracle.com/javase/7/docs/api/java/sql/Statement.html#setFetchSize%28int%29) So, it is really an issue for the libpqxx library and not an application issue. Is there a way to enforce this functionality in C++, without explicitly setting limits / offsets manually?

Alexandros
  • 2,160
  • 4
  • 27
  • 52
  • 1
    If I had to guess, I'm going to say you probably did `SELECT * FROM SOME_TABLE;` ..610mb in a single query.. is quite a lot. Well.. it's not that large of an allocation but still. I'm not sure your system can allocate that much contiguously? Why not get them in chunks? Select maybe 10M at a time 4 times. How is `vid` allocated? – Brandon Apr 28 '14 at 05:58
  • 1
    libpqxx is a bit ... primitive ... compared to the capabilties of the underlying libpq. Sometimes though you'll find that review of the libpq docs can be informative. If nothing else you can use SQL-level cursors - see the `EXECUTE` command. – Craig Ringer Apr 28 '14 at 07:27

3 Answers3

2

Use a cursor?

See also FETCH. The cursor will use it for you behind the scenes, I gather, but just in case, you can always code the streaming retrieval manually with the FETCH.

ArtemGr
  • 11,684
  • 3
  • 52
  • 85
1

To answer my own question, I adapted How to use pqxx::stateless_cursor class from libpqxx?

try {
        work W(*Conn);
        pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned>
                cursor(W, sql[sqlLoad], "mycursor", false);
        /* Assume you know total number of records returned */
        for (size_t idx = 0; idx < countRecords; idx += 100000) {
            /* Fetch 100,000 records at a time */
            result r = cursor.retrieve(idx, idx + 100000);
            for (int rownum = 0; rownum < r.size(); ++rownum) {
                const result::tuple row = r[rownum];
                vid1 = row[0].as<int>();
                vid2 = row[1].as<int>();
                vid3 = row[2].as<int>();
                .............
            }
        }
    } catch (const std::exception &e) {
        std::cerr << e.what() << std::endl;
    }
Community
  • 1
  • 1
Alexandros
  • 2,160
  • 4
  • 27
  • 52
0

Cursors are a good place to start. Here's another cursor example, using a do-while()

     const std::conStr("user=" + opt::dbUser + " password=" + opt::dbPasswd + " host=" + opt::dbHost + " dbname=" + opt::dbName);                                            

      pqxx::connection conn(connStr);
      pqxx::work txn(conn);
      std::string selectString = "SELECT id, name FROM table_name WHERE condition";

      pqxx::stateless_cursor<pqxx::cursor_base::read_only, pqxx::cursor_base::owned> 
      cursor(txn, selectString, "myCursor", false);

      //cursor variables
      size_t idx = 0;       //starting location
      size_t step = 10000;  //number of rows for each chunk
      pqxx::result result;
      do{
        //get next cursor chunk and update the index
        result = cursor.retrieve( idx, idx + step );
        idx += step;

        size_t records = result.size();
        cout << idx << ": records pulled = " << records << endl;

        for( pqxx::result::const_iterator row : result ){
          //iterate over cursor rows
        }
      }
      while( result.size() == step ); //if the result.size() != step, we're on our last loop
      cout << "Done!" << endl;

I'm iterating over approximately 33 million rows in my application. In addition to using a cursor, I used the following approach:

  1. Split the data into smaller chunks. For me, that was using bounding boxes to grab data in a given area.
  2. Construct a query to grab that chunk, and use a cursor to iterate over it.
  3. Store the chunks on the heap and free them once you're done processing the data from a given chunk.

I know this is a very late answer to your question, but I hope this might help someone!

shemake
  • 91
  • 6