3

I'm trying to sample 1e7 items from 1e5 strings but getting a memory error. It's fine sampling 1e6 items from 1e4 strings. I'm on a 64bit machine with 4GB RAM and don't think I should be reaching any memory limit at 1e7. Any ideas?

$ python3
Python 3.3.3 (default, Nov 27 2013, 17:12:35) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> K = 100

Works fine with 1e6 :

>>> N = int(1e6)
>>> np.random.choice(["id%010d"%x for x in range(N//K)], N)
array(['id0000005473', 'id0000005694', 'id0000004115', ..., 'id0000006958',
       'id0000009972', 'id0000003009'], 
      dtype='<U12')

Error with N=1e7 :

>>> N = int(1e7)
>>> np.random.choice(["id%010d"%x for x in range(N//K)], N)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1092, in mtrand.RandomState.choice (numpy/random/mtrand/mtrand.c:8229)
MemoryError
>>> 

I found this question but it seems to be about catching an error like this rather than solving it.

Python not catching MemoryError

I'd be happy with either a solution still using random.choice or a different method to do this. Thanks.

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • You are allowing resampling, so I can see a work round just using a random number in the interval required. – doctorlove Sep 02 '14 at 15:56
  • @doctorlove Thx. I've tried `['a','b','c'][np.random.choice(2,1)]` and that works for one. But `['a','b','c'][np.random.choice(2,5)]` gives a TypeError. How do I select the strings by the random numbers? I tried `.tolist()` as well, still TypeError. – Matt Dowle Sep 02 '14 at 16:05
  • things = ['a', 'b', 'c']; [things[x] for x in np.random.choice(2,5)] – doctorlove Sep 02 '14 at 16:25

1 Answers1

2

You can work round this using a generator function:

def item():
    for i in xrange(N):
      yield "id%010d"%np.random.choice(N//K,1)

This avoids needing all the items in memory at once.

doctorlove
  • 18,872
  • 2
  • 46
  • 62
  • Thanks. Have been trying this out. Can I pass a generator to `pandas.DataFrame()`? I'm testing that and not sure it's working. – Matt Dowle Sep 02 '14 at 17:30
  • I mean the `groupby` on that column returns a generator too. It feels like Pandas did a single grouping rather than reach inside the generator, if that makes sense. So I'm wondering if I need to eval the generator before passing it to Pandas? – Matt Dowle Sep 02 '14 at 17:39
  • It seems to depend which version of pandas: http://stackoverflow.com/questions/18915941/create-a-pandas-dataframe-from-generator or http://stackoverflow.com/questions/19605537/how-to-create-lazy-evaluated-dataframe-columns-in-pandas – doctorlove Sep 02 '14 at 17:59
  • Thanks again, interesting. Yes I was on version from Ubuntu stable. Have now installed from NeuroDebian and using the latest pandas: v0.14.1. Retesting ... – Matt Dowle Sep 02 '14 at 18:43
  • Passing the generator into pandas v0.14.1 gives `TypeError: object of type 'generator' has no len()`. That's fine. I didn't expect to be able to. Just need a way to get the generator to generate then? – Matt Dowle Sep 02 '14 at 18:47