Choosing a framework for larger than memory data analysis with python

Question

I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service.

What I already did

I read the .csv file with dask and converted it to castra format on disk for higher performance. I also queried the musicbrainz API and populated an sqlite DB, using peewee, with some relevant results. I choose to use a DB instead of another dask.dataframe because the process took few days and I didn't want to loose data in case of any failure.

I didn't started to really analyze the data yet. I managed to made enough mess during the rearrangement of the data.

The current problem

I'm having hard time in joining the columns from the SQL DB to the dask / castra dataframe. Actually, I'm not sure if this is viable at all.

Alternative approaches

It seems that I made some mistakes in choosing the best tools for the task. Castra is probably not mature enough and I think that it's part of the problem. In addition, it may be better to choose SQLAlchemy in favor of peewee, as it used by pandas and peewee's not.

Blaze + HDF5 might serve as good alternatives to dask + castra, mainly because of HDF5 being more stable / mature / complete than castra and blaze being less opinionated regarding data storage. E.g. it may simplify the join of the SQL DB into the main dataset.

On the other hand, I'm familiar with pandas and dask expose the "same" API. With dask I also gain parallelism.

TL;DR

I'm having a larger than memory dataset + sqlite DB that I need to join into the main dataset. I'm in doubt whether to work with dask + castra (don't know of other relevant data stores for dask.dataframe), and use SQLAlchemy to load parts of the SQL DB at a time into the dataframe with pandas. The best alternative I see is to switch to blaze + HDF5 instead. What would you suggest in this case?

Any other option / opinion is welcome. I hope that this is specific enough for SO.

score 1 · Answer 1 · answered Oct 14 '15 at 16:25

1

You're correct in the following points:

Castra is experimental and immature.

If you want something more mature you could consider HDF5 or CSV (if you're fine with slow performance). Dask.dataframe supports all of these formats just in the same way that pandas does.

It is not clear how to join between two different formats like dask.dataframe and SQL.

Probably you want to use one or the other. If you're interested in reading SQL data into dask.dataframe you could raise an issue. This would not be hard to add in common situations.

answered Oct 14 '15 at 16:25

MRocklin

55,641
23
163
235

Matthew, joining two dask.dataframes will be great. I will open a ticket for introducing `from_sql` function. Will it be able to work without describing the schema in SQLAlchemy first? Will it infer the schema automatically? – Nagasaki45 Oct 14 '15 at 16:39
We would probably mimic the `pd.read_sql` function. The difference is that you'll probably need to specify a particular column in the table along with to order/partition. – MRocklin Oct 14 '15 at 18:20

Choosing a framework for larger than memory data analysis with python

What I already did

The current problem

Alternative approaches

TL;DR

1 Answers1