I have a very big table that I can access either through a Denodo front-end or through denodo-virtual client + jaydebeapi driver in Python, making queries in VQL and saving results to pandas like describe in the manual
This table is so big that the front-end cannot be used to export it to CSV. Therefore I chose a second option and even then it was not working, so I started fragmenting the table in parts by filtering the table by a value for one of the columns.
I cannot use simple "FETCH NEXT N ROWS" because the dataset is so big it would take >1h to keep the connection open and jaydebeapi does not allow that. So I have to close and re-open the connection now and then.
Is there a way I could download N rows, close connection, re-open the connection and download next N rows? Using WHERE
statement makes the db first go through the whole table which is likely inefficient.
So far my Python code to download fragments of the table looks like this:
conn_uri2 = 'jdbc:vdb://mydb-denodo-mydb.com:9999/FOLDER?user=USER_NAME&password=USER_PASSWORDchunkSize=10000/default?hive.execution.engine=tez'
unique_column_values = ['A', 'B', 'C']
for col_value in unique_column_values :
cnxn = dbdriver.connect("com.denodo.vdp.jdbc.Driver",conn_uri2,
driver_args = {"user": denodoserver_uid,"password": denodoserver_pwd},jars = denododriver_path)
## Define a cursor and execute the results
cur = cnxn.cursor()
query = " SELECT col_A, col_B, col_C FROM big_table WHERE col_A = {col_value }"
cur.execute(query)
results2 = cur.fetchall()
headers = [column[0] for column in cur.description]
# terminate connection
cur.close()
# Get the data into the dataframe
pd.DataFrame.from_records(results2 ,columns = headers).to_parquet(f'result_{col_value}.parquet')