2

Backstory* I have recently switched from using excel to produce models of predicting the chance of being diagnosed with a particular cancer. The model was produced in an excel file and grew in both size and complexity, I made use of excels solver platform to iterate through simulations, the file achieved a size 500mb+, essentially I was starting to cross over into the realm of 'big data'.*

My question to the stack overflow community is, what is the best methodology for continuing this research. My hunch is that storing the data in a database and calling each parameter for individual analysis is a possibility. My old excel methodology used non linear regressions of each parameter (from historic data) Enabling the calculation of a percentage chance of acquiring said cancer (specific to that individual parameter), the algorithm used then weighted each parameter to achieve a final score from which I would perform a logistic regression in order to calculate the chance of a persons achieving said cancer.

Any suggestions, comments, pointers and constructive criticisms would be greatly appreciated, I have recently made the switch from excel to python to continue in this work, Kind regards AEA

AEA
  • 213
  • 2
  • 12
  • 34
  • 1
    First off, "big data" is really only relative to the tools available. That said, although 500MB might be "big data" for excel, in general people don't say you're approaching "big data" until you're at least reaching the limits of your computer's memory capactiy, which on conventional hardware is already about 4-8 GB. That said, there's a data anlytic toolkit for python you should look into: http://pandas.pydata.org/. Be sure to check the sidebar – David Marx Jun 11 '13 at 01:54
  • @DavidMarx Yep, the answer is pandas. Many people have 64/94Gb ram setups, imo that could be called "big". :) – Andy Hayden Jun 11 '13 at 02:02
  • 1
    500mb was the point at which i had to stop, I actually have about 12 times this amount (of course the size of the raw data wont be the same as what the size of the excel file.) But I can tell it has outgrown excel modeling, I was recommended python by someone studying computer science. I will check out pandas.pydata.org many thanks. – AEA Jun 11 '13 at 02:02
  • Apologies for my embarrassing miss interpretation that I was nearing big data (no sarcasm, I really am embarrassed) – AEA Jun 11 '13 at 02:06
  • For using pandas, do you make calls from a database? Or do you analyse data on the fly? – AEA Jun 11 '13 at 02:35
  • Where to store it depends on your requirements, grab it out and do analysis, [mysql works](http://stackoverflow.com/questions/16476413/how-to-insert-pandas-dataframe-via-mysqldb-into-database/16477603#16477603) can also import from text files. If large data can use [HDF5](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas/14268804#14268804) and query that, but most of the time you just read in everything to pandas (in memory) and analyse. – Andy Hayden Jun 11 '13 at 11:45

0 Answers0