I've recently began to work with Database queries when I was asked to develop a program that would have read data from the last 1 month in a Firebird database with almost 100M rows.
After stumbling a little bit, I finally managed to filter the database, using Python (and, more specifically, Pandas library), but the code takes more than 8 hours just to filter the data, so it becomes useless when trying to realize the task with the required frequency.
The rest of the code runs really quickly, since I just need around the 3000 last rows of the dataset.
So far, my function responsible to execute the query is:
def read_query(access):
start_time = time.time()
conn = pyodbc.connect(access)
df = pd.read_sql_query(r"SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'", conn)
Or, isolating the query:
SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'
Since I will only need a X number of rows from the bottom of the database (where this X number changes everyday), I know I could optimize my code by just reading part of the database, starting from the last rows, iterating through each one of the rows, without having to process the entire dataframe.
So my question is: how can it be done? And if it's not a good idea/approach whatelse could I do to solve this issue?