0

I've recently began to work with Database queries when I was asked to develop a program that would have read data from the last 1 month in a Firebird database with almost 100M rows.

After stumbling a little bit, I finally managed to filter the database, using Python (and, more specifically, Pandas library), but the code takes more than 8 hours just to filter the data, so it becomes useless when trying to realize the task with the required frequency.

The rest of the code runs really quickly, since I just need around the 3000 last rows of the dataset.

So far, my function responsible to execute the query is:

def read_query(access):

    start_time = time.time()

    conn = pyodbc.connect(access)

    df = pd.read_sql_query(r"SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'", conn)

Or, isolating the query:

SELECT * from TABLE where DAY >= DATEADD(MONTH,-1, CURRENT_TIMESTAMP(2)) AND DAY <= 'TODAY'

Since I will only need a X number of rows from the bottom of the database (where this X number changes everyday), I know I could optimize my code by just reading part of the database, starting from the last rows, iterating through each one of the rows, without having to process the entire dataframe.

So my question is: how can it be done? And if it's not a good idea/approach whatelse could I do to solve this issue?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Pedro Martins de Souza
  • 1,406
  • 1
  • 13
  • 35
  • define what do you mean by "last row" and "second last row" and generally "next row" and "prior row" in SQL terms. SQL is unordered sets language by design, so unless you introduce order - there are none. So, what exactly relationship in SQL term you mean when you say "row A goes before row B" ? – Arioch 'The Dec 13 '18 at 08:30
  • I was refering to the order the data is added to the DB. I thought there was a natural ordenation concerning this – Pedro Martins de Souza Dec 13 '18 at 11:54
  • 1
    No, there is not, unless you would make a special column for it, like autoincrementing integer "primary key". ( read about SEQUENCE/GENERATOR objects in Firebird manuals and in general SQL tutorials). SQL is language of mathematic sets, and those are unordered. For example what happens when database goes through backup-restore cycle? What happens if there are N databases with scheduled replications between them? What happens if the server decides to optimize internal layouts and reshuffle database low level structure? You can only reliably use what you explicitly declared. – Arioch 'The Dec 13 '18 at 12:17
  • Also, what if you have 100 actively inserting programs and in average at any given time there are 10 programs simultaneously inserting records, which of those simultaneous actions would be "naturally" ordered before or later? – Arioch 'The Dec 13 '18 at 12:18
  • I see now. Thanks for the explanation! The good part is: I found there is an index column our client inserted in the database. Sorted ascending. But some of the rows have a repeated index, apparently – Pedro Martins de Souza Dec 13 '18 at 12:20
  • if you want to read backwards to use "first 100 rows" approach - then you would probably need a descending index. However if you can get away with something like "TimeColumn >= Current_Time - 0.20" - and who cares how many rows there would be - then default ascending index would be better. Also is it okay to miss some rows overall or you always need to "catch up" and start from the row that your previous program execution ended with ? – Arioch 'The Dec 13 '18 at 12:23
  • Unfortunately, I can't miss anything. So I guess adding a new index would be better? – Pedro Martins de Souza Dec 13 '18 at 12:25
  • Oh, wow! then it all becomes quite different! You do not need "the 3000 last rows" you do need "all rows with ID > :Last_Read_IS" - plus you would have to make sure you only have ONE reader that ctaches up, or there can be MANY readers but they communicate with one another to avoid same rows being read by several readers. So, basically, you have to use some ID which is always increasing and never decreasing - read about generators and `before insert or update` triggers so that server would automatically put ID values. There will be gaps on rollbacks and errors, that is normal. – Arioch 'The Dec 13 '18 at 12:29
  • Now you have a warranty that 1) all the rows have unique local ID (int32 in FB2 and int64 in FB3), persistent across backup-restore, etc; 2) if row A *started* insertion after row B started then A.ID > B.ID - this would provide you for `....where ID > :Last_Read_ID` queries, that work well with default ascending indices – Arioch 'The Dec 13 '18 at 12:31
  • Notice however a potentially dangerous case: 1) program A starts inserting row A (issues "insert" statement, which implicitly or explicitly draws next unique ID from generator); 2) program B starts inserting row B; 3) program B commits transaction (ends inserting process); 4) program Reader kicks in and starts scanning end of the table. It sees row B but not row A yet; 5) program Reader ends scanning batch and record Last_Read_ID >= row B ID; 6) Program A finally commits transaction too; - now we have row A both non-scanned-yet and having their ID less than recorded Last_Read_ID; – Arioch 'The Dec 13 '18 at 12:35
  • Basically you would have to determine *and enforce* what can be a maximum longievity of inserting transactions and do a sliding window for scanning. In that windows MOST rows where already scanned, but SOME few MIGHT appear missed from scans. If an insert transaction misses the deadline, it SHOULD be rolled back, and insertion retried - just to draw the new larger ID. On of possible ways would be adding "read already / not read yet" flag into rows, `Boolean not null` in FB3 or `char(1) octet not null` in FB2. However that flag should only be accounted for IN the narrow window with ID > Last_ID – Arioch 'The Dec 13 '18 at 12:39
  • for the most of the table. Also notice that "wide" tables are somewhat expensive for `update` (due to new generation of row created internally), so if you have a column, then instead of in-row flag column it might make more sense to have an external helper table of "ID-s that were inserted but not read out yet". In this approach you won't need even "ID > Last_Read_ID" - you would need just join the data table with "dirty ID list" table. BUT - then you would have to be careful when deleting from the "dirty" table to avoid deleting way to many – Arioch 'The Dec 13 '18 at 12:42
  • So, if you work is heavy, that might be way to go. Peek some integer ID column that is warrantly unique per-row and indexed (primary key), in your data table add the `after insert or update` trigger that would log `NEW.ID` into the helper "dirty record forward log" table. Make your reader use ID's from that helper table for fetching data and carefully precisely remove them form helper table after the fact. – Arioch 'The Dec 13 '18 at 12:44
  • Account for possible `update data-table set ID=new-value where ID=old-value` requests. They should be either explicitly blocked by server, or your replicating code should be aware of such scenarios. Also think through would your replicator should do about DELETED rows. Both deleted old historical rows and deleted new not yet replicated rows. – Arioch 'The Dec 13 '18 at 12:50

1 Answers1

0

I think chunksize is your way out, please check the documentation here:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html

and also the examples posted here:

http://shichaoji.com/2016/10/11/python-iterators-loading-data-in-chunks/#Loading-data-in-chunks

Good luck!

Neo
  • 627
  • 3
  • 7
  • 1
    It does seems like a great idea! Bus only my last chunks would be used, so would you know if there is a way to start the reading of the database upside down? Otherwise, the program will have no other option, but run through the entire database to split it in chunks. – Pedro Martins de Souza Dec 12 '18 at 20:40
  • 1
    Do you mean that you want the most recent records? In what sense are those records ordered? Remember that you can always order by any column (in ascending or descenting fashion) and request TOP X rows, in which case you might not even need chunks of data (that is, if X is small enough so that all data you requested fit in main memory. Hope this helps, maybe you can have a look here: https://stackoverflow.com/questions/1876606/how-to-select-bottom-most-rows – Neo Dec 12 '18 at 21:00
  • 1
    I'm trying the attempt with TOP and ORDER BY right now. But will this truly increase the execution time? I mean, it will still have to got through all the rows to sort it. – Pedro Martins de Souza Dec 12 '18 at 21:03
  • My concern also. Not an expert in databases, but in my experience from a recent project with billions of records, the RDBMS design can greatly affect execution time. You can speed things up if, for example, the database is indexed on your timestamp column. – Neo Dec 12 '18 at 21:27
  • 1
    Firebird does not use `TOP` - FB's syntax is `Select First(10) Skip(20) a,b,c from T order by z descending, y ascending` - the manual https://www.firebirdsql.org/file/documentation/reference_manuals/fblangref25-en/html/fblangref25-dml-select.html Re: execution times - you should analyze what do you more often do with the table: insert/update or read it in backwards direction. If reading is often and changes are rare then create the descending index on the interesting columns. If reading is rare and updates are frequent then better let Firebird sort ad hoc in temp files – Arioch 'The Dec 13 '18 at 08:32
  • @Arioch'The, since reading is all I plan to do, then I believe the best approach would be to index the rows descending and then use `GROUP BY` to select the rows? Can this logic execute the query faster? – Pedro Martins de Souza Dec 13 '18 at 11:31
  • Right now, my query is looking like: `SELECT FIRST 5 PERCENT * FROM TABLE ORDER BY DATE DESC` – Pedro Martins de Souza Dec 13 '18 at 11:55
  • GROUP BY is not about limiting/filtering output - I can not get why u even mentioned it there. // there is no "percent" clause I believe in Firebird, not in FB2 at least. Notice also that in those very seconds you are reading the table - some other program may be inserting new records into it or deleting records. So what constitute %5 or 100% of rows might be different figures when you start and stop reading! – Arioch 'The Dec 13 '18 at 12:26
  • Oh, my mistake over there, it should be `ORDER BY`. And thankfully, the database only refreshes at night, so what I have is barely a copy of the original one. So it is impossible to have someone else inserting data on it. – Pedro Martins de Souza Dec 13 '18 at 12:34
  • Since you told you do not need "last N records" actually but "records I did not see ever before" - then the whole TOP/FIRST approach becomes senseless. That would be a very different task. – Arioch 'The Dec 13 '18 at 12:47