6

Maybe I start will a small introduction for my problem. I'm writing a python program which will be used for post-processing of different physical simulations. Every simulation can create up to 100 GB of output. I deal with different informations (like positions, fields and densities,...) for different time steps. I would like to have the access to all this data at once which isn't possible because I don't have enough memory on my system. Normally I use read file and then do some operations and clear the memory. Then I read other data and do some operations and clear the memory.

Now my problem. If I do it this way, then I spend a lot of time to read data more than once. This take a lot of time. I would like to read it only once and store it for an easy access. Do you know a method to store a lot of data which is really fast or which doesn't need a lot of space.

I just created a method which is around ten times faster then a normal open-read. But I use cat (linux command) for that. It's a really dirty method and I would like to kick it out of my script.

Is it possible to use databases to store this data and to get the data faster than normal reading? (sorry for this question, but I'm not a computer scientist and I don't have a lot of knowledge behind databases).

EDIT:

My cat-code look something like this - only a example:

out = string.split(os.popen("cat "+base+"phs/phs01_00023_"+time).read())
# and if I want to have this data as arrays then I normally use and reshape (if I
# need it)
out = array(out)
out = reshape(out)

Normally I would use a numpy Method numpy.loadtxt which need the same time like normal reading.:

f = open('filename')
f.read()
...

I think that loadtxt just use the normal methods with some additional code lines.

I know there are some better ways to read out data. But everything what I found was really slow. I will now try mmap and hopefully I will have a better performance.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
ahelm
  • 686
  • 1
  • 6
  • 18
  • Can you edit your question to include more specific requirements? What kind of operations? How much of the data set do you need to operate on at once? Are the operations streaming or do you need a full-subset in memory? What are your cat-methods and open-read methods? (sometimes small details can really slow things down) – payne Mar 10 '11 at 18:46
  • "It's a really dirty method"? What does that mean? It works, correct? What makes it "dirty"? Perhaps you should include the code for this. – S.Lott Mar 10 '11 at 18:46
  • At 100GB you're probably going to want some kind of other datastore/database. Be it anything from tokyocabinet to mongodb to redis to sqlite to postgresql. It's all a question of tradeoffs, and you're probably the only one who can determine what you really require. – chmullig Mar 10 '11 at 18:50
  • The dirtiness is the use of linux commands. I think it wouldn't work on Windows. I never tested it, because I don't have any Windows machine next to me. – ahelm Mar 14 '11 at 14:59

3 Answers3

7

I would try using HDF5. There are two commonly used Python interfaces, h5py and PyTables. While the latter seems to be more widespread, I prefer the former.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • Thanks for your solution. For this case I need to have my data in HDF5-format, or? My simulation results have a standard FORTRAN-format. If I understand it correctly then I can use a FORTRAN API. But I don't want to change the FORTRAN code. I didn't wrote the simulation code and it isn't mine. Maybe I can change the normal data (after reading) to HDF5. – ahelm Mar 14 '11 at 11:14
  • It should be rather easy to write a converter reading the data and writing it in HDF5 format -- that should be possible in a few lines of Python code, if the structure of your data isn't too complex. But maybe you could first try Greg's suggestion of using mmap -- if this works for you, it's easier. – Sven Marnach Mar 14 '11 at 11:30
7

If you're on a 64-bit operating system, you can use the mmap module to map that entire file into memory space. Then, reading random bits of the data can be done a lot more quickly since the OS is then responsible for managing your access patterns. Note that you don't actually need 100 GB of RAM for this to work, since the OS will manage it all in virtual memory.

I've done this with a 30 GB file (the Wikipedia XML article dump) on 64-bit FreeBSD 8 with very good results.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
0

If you're working with large datasets, Python may not be your best bet. If you want to use a database like MySQL or Postgres, you should give SQLAlchemy a try. It makes it quite easy to work with potentially large datasets using small Python objects. For example, if you use a definition like this:

from datetime import datetime
from sqlalchemy import Column, DateTime, Enum, ForeignKey, Integer, \
    MetaData, PickleType, String, Text, Table, LargeBinary
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import column_property, deferred, object_session, \
    relation, backref

SqlaBaseClass = declarative_base()

class MyDataObject(SqlaBaseClass):
  __tablename__ = 'datarows'
  eltid         = Column(Integer, primary_key=True)
  name          = Column(String(50, convert_unicode=True), nullable=False, unique=True, index=True)
  created       = Column(DateTime)
  updated       = Column(DateTime, default=datetime.today)

  mylargecontent = deferred(Column(LargeBinary))

  def __init__(self, name):
      self.name    = name
      self.created = datetime.today()

  def __repr__(self):
      return "<MyDataObject name='%s'>" %(self.name,)

Then you can easily access all rows using small data objects:

# set up database connection; open dbsession; ... 

for elt in dbsession.query(MyDataObject).all():
    print elt.eltid # does not access mylargecontent

    if (something(elt)):
        process(elt.mylargecontent) # now large binary is pulled from db server
                                    # on demand

I guess the point is: you can add as many fields to your data as you want, adding indexes as needed to speed up your search. And, most importantly, when you work with a MyDataObject, you can make potentially large fields deferred so that they are loaded only when you need them.

phooji
  • 10,086
  • 2
  • 38
  • 45
  • +1 for the Python comment, I don't think it's a good election for a high-load software. – Terseus Mar 10 '11 at 18:51
  • -1 for the python comment, scientific python users work with large datasets all the time – olokki May 16 '13 at 11:41
  • @olokki "Python may not be your best bet" I don't think that's a every controversial statement, and the rest of my answer does address the original question of how to do this Python. – phooji May 16 '13 at 18:42