3

I'm about to start working with data that is ~500 GB in size. I'd like to be able to access small components of the data at any given time with Python. I'm considering in using PyTables or MongoDB with PyMongo (or Hadoop - thanks Drahkar). Are there other file structures/DBs that I should consider?

Some of the operations I'll be doing are computing distances from one point to another. Extracting data based on indices from boolean tests and the like. The results may go online for a website, but at the moment it is intended to be only used on a desktop for analysis.

Cheers

Community
  • 1
  • 1
ebressert
  • 2,319
  • 4
  • 21
  • 27
  • 1
    There should be a requirement to leave a comment if you downvote. Why was this downvoted twice? I'm the first one to downvote a question if it sucks but this question doesn't seem unreasonable... – Pete Oct 08 '12 at 12:06
  • 2
    You may also wish to consider [HDF5](http://stackoverflow.com/a/7891137/190597). – unutbu Oct 08 '12 at 12:19
  • unutbu - That's a good idea. PyTables is based on that. I'm a co-developer for an astronomy data read/write package called ATpy (http://atpy.github.com/) and we make use of HDF5, but accessing subsets of the data requires some big re-writing in the code. It may be the best solution in the end, but I'm waiting to hear what others may suggest before making the commitment. – ebressert Oct 08 '12 at 12:26
  • 1
    I'm surprised that this question has been closed. After doing some R&D for the last few days and I have a summary report that I'd like to provide here. Is it only possible once the question has been reopened? – ebressert Oct 18 '12 at 12:15

1 Answers1

1

If you are seriously looking at data processing on a Big Data process, I would highly suggest looking into Hadoop. One provider being Cloudera ( http://www.cloudera.com/ ). It is a very powerful platform that has many tools within it for data processing. Many languages, including Python, have modules for accessing the data, plus a hadoop cluster can do a significant amount of the processing for you once you have built the various mapreduce, Hive and hbase jobs for it.

Drahkar
  • 1,694
  • 13
  • 17
  • Thanks for the suggestion. I have looked at Hadoop as well. Let me edit my question to include it. I'm curious what the consensus will be. Is the Python support for Hadoop good comparable or better than MongoDB? – ebressert Oct 08 '12 at 12:12
  • Someone suggested Riak for Python: https://github.com/basho/riak-python-client. Getting closer to a closure on this. If I find something, something will be posted on here in case anyone has similar questions. – ebressert Oct 09 '12 at 12:25
  • The purposes oh hadoop versus mongodb, couchdb, couchbase, etc are significant. Mongodb, couchdb, and couchbase are all nosql solutions where hadoop is a storage and analyzing cluster. So what you need depends heavily on what you need to use it for specifically. – Drahkar Oct 10 '12 at 02:44