2

How does one go about (asides from trail and error) to determine the amount of RAM required to store one's dataset?

I know this is a super general question, so, hopefully this example can narrow down what I am trying to understand:

I have a data file, the data file contains characters[A-Z] and numbers (no special symbols). I want to read the data into RAM (using python), then I store the data in a dictionary. I have a lot of data and computer with only 2 gigs of RAM, so I'd like to know ahead of time whether the data would fit into RAM as this could change the way I load the file with Python and handle the data downstream. I recognize that all the data may not all fit into RAM - but that's another problem, I just want to know how much RAM the data would take up and what I need to consider to make this determination.

So knowing the content of my file, it's initial size, and the downstream data structure I want to use, how can I figure out the amount of RAM the data will take-up?

user4673
  • 153
  • 1
  • 7
  • Will it be a single dictionary or multiple dictionaries? I think it shouldn't be far from the file size. – aIKid Jan 04 '14 at 00:30
  • if you read the whole file into RAM, you need at least N bytes, if N is the size of the file. Then, when you build the dictionary, you should know how python implements them, so you may know the O function of n in memory usage, if it's known. It's an estimate of course, and likely rather broad. – ShinTakezou Jan 04 '14 at 00:33
  • @ShinTakezou does it matter at the nature of the data stored? For example if I were to store 32-bit ints, or 8-bit ints for my numeric data? I know python's typing is dynamic, but if I could specify it, as one would in other languages more rountinely, how much of a difference does that make? – user4673 Jan 04 '14 at 00:38
  • 2
    Check out http://stackoverflow.com/questions/110259/which-python-memory-profiler-is-recommended - more specifically, heapy – NG. Jan 04 '14 at 00:43
  • 1
    @user4673 the data type does matter, and can save you space as you point out (for instance a factor of 4 between 32bit and 8bit integers. You might want to have a look at using a numpy array if part of your dataset is purely numeric (obviously you have some strings, but perhaps not every entry is a string?) numpy will likely be faster when you come to access/analyse the data too. Perhaps you could provide a sample or your dataset in the question? – three_pineapples Jan 04 '14 at 00:47
  • Data sets I typically work with are something called fastq files: http://en.wikipedia.org/wiki/FASTQ_format I know there are optimized methods for such data as well. I typically store each read (with its ID, sequence and quality) as a class and make a dictionary of type read. Eventually, that will run out of memory when I analyze something large enough. – user4673 Jan 04 '14 at 00:50
  • 1
    @user4673: Python doesn't have 8-bit ints or 32-bit ints; it has arbitary-length ints that take up anywhere from 16 bits to 128Gbits depending on how big they are. More importantly, every Python object, including an integer, is "boxed", with a header that's dozens of bytes long. On the other hand, Python can also collapse immutable built-in types, so if you have 1000000 copies of the number `1`, you probably only _really_ have 1 copy of the number `1` (and 1000000 references to it, but a reference is only the same of a pointer). – abarnert Jan 04 '14 at 00:51
  • Anyway, if you need to mix numerical data and string data, take a look at Pandas, which sits on top of NumPy (the library @three_pineapples recommended). – abarnert Jan 04 '14 at 00:52
  • Finally, do you really need all of the data in RAM, or do you just need it to act like an in-RAM `dict`? Because you can get the latter with, e.g., the `dbm` or `shelve` modules without using up all of your memory. – abarnert Jan 04 '14 at 00:53
  • "I store the data in a dictionary". That's not really enough detail. If you mean that you read the whole file into a `bytes` object, then create a dictionary `{ 'data' : my_bytes_object }`, then it takes an amount of RAM equal to the file size, plus a negligible amount for the dictionary. If you mean something different from that, then the amount of RAM used depends what you mean. Because of per-object overheads, it matters how many objects you chop your data into (and of course the types of those objects). The overheads vary between versions of Python and 32 vs 64 bit... – Steve Jessop Jan 04 '14 at 00:53
  • @abarnert If things get to large, I will need it to act like an in-RAM dict ideally. Thank you for the module recommendations. – user4673 Jan 04 '14 at 00:55
  • ... for example in Python 3.2 your string data will be double the size if stored in `str` compared with storing it in `bytes`. Python 3.3 introduces variably-encoded `str` objects, so provided your data is all `A-Z` that difference disappears. – Steve Jessop Jan 04 '14 at 00:58

1 Answers1

4

The best thing to do here is not to try to guess, or to read the source code and write up a rigorous proof, but to do some tests. There are a lot of complexities that make these things hard to predict. For example, if you have 100K copies of the same string, will Python store 100K copies of the actual string data, or just 1? It depends on your Python interpreter and version, and all kinds of other things.

The documentation for sys.getsizeof has a link to a recursive sizeof recipe. And that's exactly what you need to measure how much storage your data structure is using.

So load in, say, the first 1% of your data and see how much memory it uses. Then load in 5% and make sure it's about 5x as big. If so, you can guess that your full data will be 20x as big again.

(Obviously this doesn't work for all conceivable data—there are some objects that have more cross-links the farther you get into the file, others—like numbers—that might just get larger, etc. But it will work for a lot of realistic kinds of data. And if you're really worried, you can always test the last 5% against the first 5% and see how they differ, right?)

You can also test at a higher level by using modules like Heapy, or completely externally by just watching with Process Manager/Activity Monitor/etc., to double-check the results. One thing to keep in mind is that many of these external measures will show you the peak memory usage of your program, not the current memory usage. And it's not even clear what you'd want to call "current memory usage" anyway. (Python rarely releases memory back to the OS. If it leaves memory unused, it will likely get paged out of physical memory by the OS, but the VM size won't come down. Does that count as in-use to you, or not?)

abarnert
  • 354,177
  • 51
  • 601
  • 671