0

I have a very big table that I can access either through a Denodo front-end or through denodo-virtual client + jaydebeapi driver in Python, making queries in VQL and saving results to pandas like describe in the manual

This table is so big that the front-end cannot be used to export it to CSV. Therefore I chose a second option and even then it was not working, so I started fragmenting the table in parts by filtering the table by a value for one of the columns.

I cannot use simple "FETCH NEXT N ROWS" because the dataset is so big it would take >1h to keep the connection open and jaydebeapi does not allow that. So I have to close and re-open the connection now and then.

Is there a way I could download N rows, close connection, re-open the connection and download next N rows? Using WHERE statement makes the db first go through the whole table which is likely inefficient.

So far my Python code to download fragments of the table looks like this:

conn_uri2 = 'jdbc:vdb://mydb-denodo-mydb.com:9999/FOLDER?user=USER_NAME&password=USER_PASSWORDchunkSize=10000/default?hive.execution.engine=tez'
unique_column_values = ['A', 'B', 'C']

for col_value in unique_column_values :
    cnxn = dbdriver.connect("com.denodo.vdp.jdbc.Driver",conn_uri2, 
        driver_args = {"user": denodoserver_uid,"password": denodoserver_pwd},jars = denododriver_path)
    
    ## Define a cursor and execute the results
    cur = cnxn.cursor()
    query = " SELECT col_A, col_B, col_C FROM big_table WHERE col_A = {col_value }"
    cur.execute(query)
    results2 = cur.fetchall()
    headers = [column[0] for column in cur.description]

    # terminate connection
    cur.close()

    # Get the data into the dataframe 
    pd.DataFrame.from_records(results2 ,columns = headers).to_parquet(f'result_{col_value}.parquet')
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
euh
  • 319
  • 2
  • 11

0 Answers0