0

I'm doing analysis on data from a MySql database in python. I query the database for about 200,000 rows of data, then analyze in python using Pandas. I will often do many iterations over the same data, changing different variables, parameters, and such. Each time I run the program, I query the remote database (about 10 second query), then discard the query results when the program finishes. I'd like to save the results of the last query in a local file, then check each time I run the program to see if the query is the same, then just use the saved results. I guess I could just write the Pandas dataframe to a csv, but is there a better/easier/faster way to do this?

Troy D
  • 2,093
  • 1
  • 14
  • 28
  • 200K isn't a lot of data. Iterating is *always* slow though, which is why it isn't used for querying, reporting or analyzing data. The database reports produced by well designed applications *don't* load all data in memory before processing. Either create appropriate queries and indexes that will return the results you need without iterations, or create a separate reporting database, whose schema is appropriate for analysis, eg a star schema – Panagiotis Kanavos Apr 04 '17 at 15:57
  • Sounds like a good plan. Write it to a local file then read the file. Can't think of anything easier. – RobertB Apr 04 '17 at 16:01
  • @RobertB actually it's bad and slow. Why iterate in memory when you have a database that can perform set based operations faster than any iteration that doesn't use any kind of optimized access? In 10 seconds, the database should produce the *final* result. It's only 200K rows – Panagiotis Kanavos Apr 04 '17 at 16:02
  • 1
    You are assuming he knows how to rewrite his analysis in SQL. Your advice is solid, but I'm just answering his question. If he has an investment in Python code and doesn't have the time or know-how to rewrite it in SQL, his idea might be a quick and dirty way to get some speedup. – RobertB Apr 04 '17 at 16:41

2 Answers2

0

If for any reason MySQL Query Cache doesn't help, then I'd recommend to save the latest result set either in HDF5 format or in Feather format. Both formats are pretty fast. You may find some demos and tests here:

https://stackoverflow.com/a/37929007/5741205

https://stackoverflow.com/a/42750132/5741205

https://stackoverflow.com/a/42022053/5741205

Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
0

Just use pickle to write the dataframe to a file, and to read it back out ("unpickle").

https://docs.python.org/3/library/pickle.html

This would be the "easy way".

checker
  • 531
  • 2
  • 6
  • 18