2

I have RAM concerns, and I want to downsize my data I loaded (with read_stata() you cannot only a few rows, sadly). Can I change the code below to use only some rows for X and y, but not make a copy? That would, even if temporarily defeat the purpose, I want to save on memory, not add ever more to my footprint. Or probably downsize the data first (does `reshape' do that without a copy if you specify a smaller size than the original?) and then pick some columns?

data = pd.read_stata('S:/data/controls/notreat.dta')
X = data.iloc[:,1:]
y = data.iloc[:,0]
László
  • 3,914
  • 8
  • 34
  • 49
  • 1
    are your ram concerns justified by profiling data? are they causing issues? or are you doing the root of all evil by early optimization? – Joran Beasley Jun 23 '14 at 18:12
  • 1
    The root of all evil, thanks for teaching me something useful. Seriously, now I do have a duplicate copy of the file that is only 20 GB, a bit faster to load, and I can afford a copy to downsize (system has 128 GB). But the full data is 100 GB, I would not downsize with a copy, thanks. – László Jun 23 '14 at 18:23
  • 2
    to be honest, I would read in the data using the ``read_stata`` call, then write it out to a csv (``to_csv``) (possibly using only certain columns) (or HDF). Then you can chunk read this in easy (``chunksize`` parameter) and do what you need. – Jeff Jun 23 '14 at 19:13
  • @László all I was really asking was "are you sure this is your bottle neck" , since you had not qualified the questions with any metrics that demonstrated an actual issue. now that you have qualified it as 100GB... im kind of suprised you are even able to load the entirety of the original data without running out of memory – Joran Beasley Jun 23 '14 at 19:26
  • OK. I tried to make CSV in Stata but it got so slow I freaked out that the CSV file is blowing out of my disk. Can I do this with "online data" on the fly in Python? Meaning that the data is so big I cannot hold a copy in memory. Will Python do this line by line or something? – László Jun 23 '14 at 19:26
  • @JoranBeasley You are right, actually I haven't loaded that version yet, only tested on smaller files. So maybe it's hopeless. – László Jun 23 '14 at 19:27
  • it can certainly read chunks from pretty much any IO source ... it might be hard to interpret in chunk sizes (Im not familiar with the stata format) – Joran Beasley Jun 23 '14 at 19:28
  • @JoranBeasley You are right, it is surprising pandas does not support row-selection for `read_stata`, but I don't know the details. – László Jun 23 '14 at 19:29
  • 1
    pandas provides you with their source ... you could always add a `iterative_read_strata` method or something ... – Joran Beasley Jun 23 '14 at 20:34
  • By the way, just to note, my machine learning application is amazingly fast on the downsized data, so until I downsize, by far the most time is spent loading in the dta file. Really slow, esp. if it is unnecessary as I will downsize immediately. – László Jun 23 '14 at 21:38
  • And to add one more note for posterity: Loading a 10GB dta file took up more 70 GB while loading, and blew me out of memory. I would have thought pandas won't be much less efficient with dta than with csv, but I learnt my lesson. – László Jun 24 '14 at 08:02

1 Answers1

0

I feel your pain. Pandas is not a memory-friendly library, and 500Mb can quickly turn into >16Gb and shredding performance.

However, one thing that's worked for me is memmap. You can use memmap to page in numpy arrays and matrices just about as fast as your databus permits. And as an added benefit, unused pages may be unloaded.

See here for details. With some work, these memmap np arrays can be used to back a pd.Series or a pd.DataFrame without copying. However, you may find that Pandas later copies you data as you proceed. So, my advice: create a memmap file, and stay in numpy-land.

Your other alternative is to use HDFS.

user48956
  • 14,850
  • 19
  • 93
  • 154