1

Background:

I have an application written in Python to monitor the status of tools. The tools send their data from specific runs and it all gets stored in an Oracle database as JSON files.

My Problem/Solution:

Instead of connecting to the DB and then querying it repeatedly when I want to compare the current run data to the previous run's data, I want to make a copy of the database query so that I can compare the new run data to the copy that I made instead of to the results of the query. The reason I want to do this is because constantly querying the server for the previous run's data is slow and puts unwanted load/usage on the server.

For the previous run's data there are multiple files associated with it (because there are multiple tools) and therefore each query has more than one file that would need to be copied. Locally storing the copies of the files in the query is what I intended to do, but I was wondering what the best way to go about this was since I am relativity new to doing something like this.

  • So any help and suggestions on how to efficiently store the results of a query, which are multiple JSON files, would be greatly appreciated!
Community
  • 1
  • 1
jason
  • 31
  • 1
  • 7
  • why dont you md5 the data and just compare the new md5 with the old hash – mbieren Jul 11 '17 at 14:09
  • @mbieren When I said compare I mean I want to be able to then extract the values from it and then compare that. I will be doing data analysis with the previous data and new data. From your comment I assume you meant to hash it to compare to see if they were the same and that is not what I want to do. Sorry for any confusion in the original wording of my post. If I am wrong with my assumption of what you meant, please correct me. Thanks. – jason Jul 11 '17 at 14:13
  • I think the efficency highly depends on the amount of data and how the values are stored in your database. Is the data stored in varchar2 (4000 or 32767) or CLOB ? – mbieren Jul 11 '17 at 14:24
  • 1
    Which Oracle Database version do you use? Oracle added JSON functionality to 12c. Your queries for the previous data should perform very good with the new feature. – o0x258 Jul 11 '17 at 14:27
  • Thats my guess too. Storing them locally on a TmpFile would not be much faster. Beside the Space issue – mbieren Jul 11 '17 at 14:30
  • @mbieren I do not know what the is stored in in regards to what you mentioned, it is a JSON table/file and looks like a dictionary. Also regarding your second comment with the TmpFile... do you have any data/proof that it would not be much faster? Edit: another thing, file size ranges in 4K to 50K. There can be just a couple files all the way up to something you could count by 10s (10, 20, 30) but not usually more than 50. – jason Jul 11 '17 at 14:47
  • @ora-600 I don't know, I can find out and get back to you later though. (I do not think it is anything super recent) – jason Jul 11 '17 at 14:48
  • Depending on your amount of data (no of files and file sizes) I would give it a try reading from db. Esp. if you have to write them to the db anyway and you already have (and must have) the dbconnection. – mbieren Jul 11 '17 at 15:06
  • Queries are extremely slow for whatever reason. due to the database not file size, heavy server load, all the above, or something else. I am not really sure since I do not have access to it directly. After talking to someone else I know they were suggesting that making a copy of the query to store would be faster than making multiple queries – jason Jul 11 '17 at 15:13
  • @mbieren is there a specific way you would do what you mentioned? Also, if by already written to the DB do you mean the data being put in there in the first place? that is done by an API that communicates with the tools directly. – jason Jul 11 '17 at 15:16
  • I will write an answer. pls look – mbieren Jul 11 '17 at 20:58

1 Answers1

1

As you described querying the db too many times is not an option. OK in that case I would do this the following way :

When your program starts you get the data for all tools as a set of JSON-Files per tool right? OK. I am not sure how you get the data by querying the tools directly or by querying the db .. does not matter.

You check if you have old data in the "cache-dictionary" for that tool. If yes do your compare and store the "new data" as "previous data" in the cache. Ready for the next run. Do this for all tools. This loops forever :-)

This "cache dictionary" now can be implemented in memory or on disk. For your amount of data I think memory is just fine.

With that approach you do not have to query the db for the old data. The case that you cannot do the compare if you do not have old data in the "cache" at program start could be handled that you try to get it from db (risking long query times but what to do :-)

mbieren
  • 1,034
  • 8
  • 31
  • Thank you very much for the answer! I get all of the data from the database. The tools have their own way of uploading their data to the database that I do not interact with directly, I just grab the data in the database that is from the tools. I am new to database interaction so I am still trying to learn some of the odds and ends. In your opinion, what is one of the easiest (but efficient) ways to query the database and create the "cache dictionary"? Or in the case that I do not have old data, what would be the best way to get it from the database and create the "cache"? – jason Jul 12 '17 at 12:59
  • DB interaction with python is not a big thing https://stackoverflow.com/questions/3521692/how-can-i-access-oracle-from-python. And for reading json data from oracle I am pretty sure that you can read the file as string and then use a standard way to create a dict eg `obj = json.loads(json_string)` where json string comes from your select. – mbieren Jul 12 '17 at 16:34