4

I wrote a Python script that loads an user/artist/playcount dataset and predicts which artists I might like. However, the database (a .tsv file I downloaded) is big so it takes time to read it and store the information I want in a dictionary. How can I optimize this? Is there a way to preserve the loaded database so each time I want to make predictions I don't have to load it again?

Thank you very much.

Alejandro
  • 105
  • 7
  • 1
    Use an actual database. It doesn't try to hold everything in memory at once. `sqlite` is very straightforward. Another option is to have a daemon process always running, with the db loaded. You communicate with it through IPC (pipes, sockets etc). – Paul Rooney Jan 11 '17 at 01:36
  • I found this solution pretty convenient: http://stackoverflow.com/a/6687707/6027071 – Alejandro Jan 11 '17 at 07:11

1 Answers1

2

You could store and load your dictionary using the shelve module. This is likely to yield a benefit if the processing time to create the dictionary is large relative to the amount of time it takes to load it into memory - that is, if your algorithm is complicated or your dictionary is small.

If your dictionary is still going to be large, one trick you could use is to store file pointer offsets as the dictionary values. That is, when you want a dictionary value to be some information about a song (for example), instead of storing the information itself in the dictionary, store the byte offset in the TSV file where the corresponding line starts. Then, when you want to access that information, open the TSV file, seek to the offset, read a line, and parse it to construct the object representing that song. Seeks are fast, or at least much faster than reading through the whole file. Alternatively, you could use the mmap module to memory-map the file and effectively treat it as an array of bytes, which is especially useful if you know how many bytes you'll need (or at least have a reasonably low upper bound).

If you want to maintain compatibility with other systems written in other programming languages, or if you just want something human-readable, you could store your dictionary as JSON instead, using the json module. I would recommend this only if your dictionary is not too large.

Another solution you could try is just storing the information from your dictionary in a database in the first place. Databases are organized in a way that makes accessing them fast. Python's standard library includes the sqlite3 module that you can use to access an SQLite database. This should be fine. But if you already have a database server running, or you have special needs that make using a separate database server advantageous (like multiple processes accessing the database simultaneously), you can use SQLAlchemy to store and load data in any SQL database.

For completeness I would also mention the pickle module, which can be used to store pretty much any Python object, but I don't think you need to use it directly. There are more streamlined ways to store and load dictionary-type data.

David Z
  • 128,184
  • 27
  • 255
  • 279