4

I need to manipulate a large amount of numerical/textual data, say total of 10 billion entries which can be theoretically organized as 1000 of 10000*1000 tables. Most calculations need to be performed on a small subset of data each time (specific rows or columns), such that I don't need all the data at once.

Therefore, I am intersted to store the data in some kind of database so I can easily search the database, retrieve multiple rows/columns matching defined criteria, make some calculations and update the database.The database should be accessible with both Python and Matlab, where I use Python mainly for creating raw data and putting it into database and Matlab for the data processing.

The whole project runs on Windows 7. What is the best and mainly the simplest database I can use for this purpose? I have no prior experience with databases at all.

Sasha
  • 5,783
  • 8
  • 33
  • 37

3 Answers3

7

I would recommend SQLite. The default Python installation already has bindings for it.

To use install the appropriate SQLite Windows installer.

To create the database you can do something like (from the sqlite3 documentation):

import sqlite3

conn = sqlite3.connect('example.db')

c = conn.cursor()

# Create table
c.execute('''CREATE TABLE stocks
             (date text, trans text, symbol text, qty real, price real)''')

# Insert a row of data
c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")

# Save (commit) the changes
conn.commit()

# We can also close the cursor if we are done with it
c.close() 

And to import into Matlab you can use mksqlite.

For more information you might want to checkout: http://labrosa.ee.columbia.edu/millionsong/pages/sqlite-interfaces-python-and-matlab

3

IMO simply use the file system with a file format that can you read/write in both MATLAB and Python. Databases usually imply a relational model (excluding the No-SQL ones), which would only add complexity here.

Being more MATLAB-inclined, you can directly manipulate MAT-files in SciPy with scipy.io.loadmat/scipy.io.savemat functions. This is the native MATLAB format for storing data, with save/load functions.

Unless of course you really need databases, then ignore my answer :)

Amro
  • 123,847
  • 25
  • 243
  • 454
  • Is there any fault protection using this method? It would seem if data is being written and there was some sort of crash that the data file would be corrupted. –  Jun 15 '12 at 21:44
  • @Michael: none whatsoever (apart from what the OS provides). But I got the feeling that the OP is creating the data *once*, then doing the analysis in MATLAB, so its not an issue really. If such a feature is desired, databases are the way to go.. – Amro Jun 15 '12 at 21:52
  • I have already used this approach, but it does not provide a suitable solution since it does not allow a simple access to the data spanned across multiple files, e.g. compare first row of all tables. In this example you should open 1000 files with 10000 rows each, just to retrieve one row from each... – Sasha Jun 15 '12 at 21:59
  • 1
    @amro With the amount of data being talked about here I know I wouldn't risk using a straight file. Re-importing all the data would waste a lot of time if a crash did happen during processing. –  Jun 15 '12 at 22:00
  • @Sasha If you need to open 1,000 files to read 1 row from each you may have a problem with your [database schema](http://en.wikipedia.org/wiki/Database_schema). –  Jun 15 '12 at 22:00
  • @Sasha: I don't want to push my approach too much, but recent versions of MATLAB have the ability to efficiently save/load parts of variables stored in MAT-files: See [MATFILE](http://www.mathworks.com/help/techdoc/ref/matfile.html) – Amro Jun 15 '12 at 22:05
  • @Sasha: For what it's worth, also check out this [previous answer](http://stackoverflow.com/a/6600301/97160) of mine, where I give an overview of the different ways to access databases from MATLAB.. Just in case you decide to go that route :) – Amro Jun 15 '12 at 22:12
  • Consider the following example. The database contains trading data. Assume you have 1000 stocks, 10000 trading days for each stock and 1000 parameters for each trading day for each stock. Assume you create 1000 mat files (file=stock) each containing 10000*1000 matrix. Now, consider the following tasks: 1. Analyze specific stock (open 1 file) 2. Find 5 stocks with the highest trade volume at specific day (open all files to retrieve only 1 parameter). I was hoping that the database approach would be more efficient in case when both types of queries should be supported. Am I right? – Sasha Jun 15 '12 at 22:51
2

SQLite is easy to setup, but I had no problem with MySQL.There are available connectors, and work pretty seamlessly.

http://www.mathworks.com/matlabcentral/fileexchange/8663-mysql-database-connector

I am running a similar project, where I use Matlab to fetch and analyze, and Ruby on Rails to publish a lot of stock market data. Using very large datasets and this solution seems to work well. Historically, SQLite3 doesn't perform as well as MySQL or PostgreSQL for large datasets, which is the reason I recommend switching.

zachd1_618
  • 4,210
  • 6
  • 34
  • 47