1

I am still learning Python, I need to handle a case when a sql query doesn't provide any rows, with a pandas read_sql function with chunksize param.

Here is the current line :

df = pd.concat([x for x in pd.read_sql(SQL_request,self.connection, chunksize=50000)], ignore_index=True)

When the query returns zero rows I get this error :

  File "[....]\lib\site-packages\pandas\core\reshape\concat.py", line 239, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

What is the best way to handle this ? I need to return an empty dataframe even if there is no rows (the columns must be there). I need to keep the chunking, it really helps not using too much memory.

I thought about running a first query without chunking and check if there is any rows, and then run a second chunked query. But I feel it is a very bad and inefficient idea.

Yass T
  • 54
  • 1
  • 9

2 Answers2

1

Since you are trying to concatenate all chunks into a whole dataframe, it seems like you are using chunk only to avoid using too much memory. May be you can try our tool ConnectorX (pip install -U connectorx), which aims to improve the performance of pandas.read_sql in terms of both time and memory usage, and provides similar API. To switch to it, you only need to:

import connectorx as cx

# currently ConnectorX support postgres, mysql, oracle, mssql and sqlite
# conn_url example on mysql: mysql://username:password@server:port/database

df = cx.read_sql(conn_url, SQL_request)

The reason pandas.read_sql uses a lot of memory during running is because of its large intermediate python objects, in ConnectorX we use Rust and streaming process internally to tackle this problem.

Here is some benchmark result on memory usage:

PostgreSQL: postgres

MySQL: mysql

Xiaoying Wang
  • 329
  • 4
  • 9
0

Try this:

df = pd.concat([x for x in pd.read_sql(SQL_request,self.connection, chunksize=50000) if not x.empty] , ignore_index=True)

EDIT:

Ah got it. Can you try the following code then? I'll update the answer if it works.

try: 
    df = pd.concat([x for x in pd.read_sql(SQL_request,self.connection, chunksize=50000)] , ignore_index=True)
except:
    df = pd.read_sql(SQL_request,self.connection)
Carlos Azevedo
  • 660
  • 3
  • 13
  • try the edit in the answer above (I couldn't format the code in the comments). – Carlos Azevedo Oct 23 '18 at 14:03
  • When you use chunksize it returns a generator instead of a Dataframe. And there's no trivial way to test if the generator is empty or not: [link](https://stackoverflow.com/questions/661603/how-do-i-know-if-a-generator-is-empty-from-the-start) . So the solution I provide is to try and if it fails go get the empty dataframe. – Carlos Azevedo Oct 23 '18 at 14:09