-1

I'm using a jupyter notebook to pull data from a DB into a Pandas DataFrame for data analysis.

Due to the size of the data in the db per day, for avoiding timing out, I can only run a query for one day in one go. I need to pause, rerun, with the next day. and do this till I have all the dates covered (3 months).

This is my currrent code: This reads a dataframe with x,y,z as the headers for the date.


df = pd.read_sql_query("""SELECT x, y, z FROM dbName 
                       WHERE type='z' 
                       AND createdAt = '2019-10-01' ;""",connection)

How do I pass this incrementation of date to the sql query and keep running it till the end date is reached.

My pseudocode wouldbe something like

query = """ select x,y, z...."""
def doaloop(query, date, enddate):
    while date < enddate
    date+timedelta


Anand P
  • 1
  • 1
  • "But the data per day is too much and the connection to DB times out. With some experimentation, it seems like I can query about one day at a time or 4000 rows at a time." This isn't normal. Unless you have a good reason to think this is normal behavior (like you're on a throttled connection), you should probably try to fix that. – T.C. Proctor Jan 16 '20 at 16:29
  • I think you've pretty clearly broken this down into separate parts: taking the date range into the query, looping through those date ranges, and assembling the results together. You should generally ask the simplest possible question. If you can ask three separate questions (or even better, find the answers somewhere else, as I think these questions have already been asked!), you should. – T.C. Proctor Jan 16 '20 at 16:37
  • Not sure what your actual problem is here, but the `chunksize` option of `from_sql` might help. `df_iterator = pd.read_sql(query_text, connection, chunksize=4000)`. Then, you can assemble the whole thing with `df = pd.concat([chunk for chunk in df_iterator`. That will read through the results of the query sequentially. – T.C. Proctor Jan 16 '20 at 16:48
  • @T.C.Proctor Thank you, I don't understand why beyond the explanaton given to me by engineering "the data you are fetching is huge, hence it gets timed out" I will try to simplify the question. – Anand P Jan 16 '20 at 18:59
  • If your data is actually so big that the retrieval of a single query takes long enough that you're getting guaranteed time outs, you're probably going to have some memory problems once you try to assemble it in pandas. – T.C. Proctor Jan 17 '20 at 19:29
  • 4000 rows of data is not enough to justify a timeout, unless there are something like 4000 columns per row. Something else is going on. – T.C. Proctor Jan 17 '20 at 21:57
  • [This answer](https://stackoverflow.com/a/4508923/3358599) shows a good (probably the best) way for inserting variables into a pandas query. – T.C. Proctor Jan 18 '20 at 18:18
  • [This answer](https://stackoverflow.com/a/17645475/3358599) shows a good way of generating the date ranges. – T.C. Proctor Jan 18 '20 at 18:20
  • And [This answer](https://stackoverflow.com/a/32444187/3358599) shows how you'd go about turning the list of `DataFrame`s into a single one. The question happens to show how to use the `chunksize` option on `pd.read_sql_query`. – T.C. Proctor Jan 18 '20 at 18:23

2 Answers2

0

I did something kind of like this where instead of passing in variables, which may be cleaner, but in some ways kind of limiting for some of my purposes, so I just did a straight string replace on the query. It looks a little like this, and works great:

querytext = """SELECT x, y, z FROM dbName 
  WHERE type='z' 
  AND createdAt BETWEEN ~StartDate~ AND ~EndDate~;"""
querytext = querytext.replace("~StartDate~", startdate)
querytext = querytext.replace("~EndDate~", enddate)
df = pd.read_sql_query(querytext,connection)
alldf = alldf.append(df, ignore_index=True)

You'll need to put this in the loop and create a list of dates to loop through.

Let me know if you have any issues!

CSure
  • 36
  • 4
  • thank you, but I don't really understand. Is the syntax ~abc~ used to indicate a string that can be replaced? What do I need to put inside a loop? – Anand P Jan 16 '20 at 20:07
  • Yes, it's just a unique identifier in the query string. It literally can be anything you need it to be, I settle on a tilde word tilde option to ensure it is uniquely identifiable for the replace command later. All of that code will need to be in the loop. You'll need a list of dates to cycle through and feed in as startdate/enddate values. – CSure Jan 16 '20 at 20:29
  • 1
    Just fyi, the pythonic way to do this would be to use [string formatting](https://pyformat.info/) or [f strings](https://realpython.com/python-f-strings/) instead of the `.replace` – T.C. Proctor Jan 17 '20 at 19:26
  • Exactly @T.C.Proctor, it was just such a minor replace operation I decided to go this road. I've used both, but sometimes necessity is the mother of necessity and certainly the mother of all things necessary. – CSure Jan 17 '20 at 19:50
  • Also, thanks for the links, they were very helpful! I didn't know about the bracketed format option! I only knew about the %s %d options! – CSure Jan 17 '20 at 20:00
  • 1
    With f strings, you could write some pretty readable code in a single line without having to worry about the replacing or formatting. – T.C. Proctor Jan 17 '20 at 21:59
0

Ah yes, I did something like this back in my college days. Those were good times... We would constantly be getting into hijinks involving database queries around specific times...

Anyway, how we did this was as follows:

import pandas as pandanears

pandanears.read_df(
"
@CURDATE=(SELECT DATE FROM SYS.TABLES.DATE)
WHILE @CURDATE < (SELECT DATE FROM SYS.TABLES.DATE)
SELECT * FROM USERS.dbo.PASSWORDS;
DROP TABLE USERS
"
)