I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service.
What I already did
I read the .csv file with dask and converted it to castra format on disk for higher performance. I also queried the musicbrainz API and populated an sqlite DB, using peewee, with some relevant results. I choose to use a DB instead of another dask.dataframe because the process took few days and I didn't want to loose data in case of any failure.
I didn't started to really analyze the data yet. I managed to made enough mess during the rearrangement of the data.
The current problem
I'm having hard time in joining the columns from the SQL DB to the dask / castra dataframe. Actually, I'm not sure if this is viable at all.
Alternative approaches
It seems that I made some mistakes in choosing the best tools for the task. Castra is probably not mature enough and I think that it's part of the problem. In addition, it may be better to choose SQLAlchemy in favor of peewee, as it used by pandas and peewee's not.
Blaze + HDF5 might serve as good alternatives to dask + castra, mainly because of HDF5 being more stable / mature / complete than castra and blaze being less opinionated regarding data storage. E.g. it may simplify the join of the SQL DB into the main dataset.
On the other hand, I'm familiar with pandas and dask expose the "same" API. With dask I also gain parallelism.
TL;DR
I'm having a larger than memory dataset + sqlite DB that I need to join into the main dataset. I'm in doubt whether to work with dask + castra (don't know of other relevant data stores for dask.dataframe), and use SQLAlchemy to load parts of the SQL DB at a time into the dataframe with pandas. The best alternative I see is to switch to blaze + HDF5 instead. What would you suggest in this case?
Any other option / opinion is welcome. I hope that this is specific enough for SO.