-4

I was wondering:

  1. If the size of floats, lists and tuples or any type of variable have the same size or memory usage when running a script and when saved on disk?
  2. In what format or data type are floats temporarily saved while script is running?
  3. Will the result of getsizeof on a list or tuple of floats be the same as if I save it on disk, if yes, in which format should I save it, assuming I don't need compression?
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Jones
  • 92
  • 9
  • 2
    none of these have any relevance ... if you need to store it on disk the size will depend on how you store it... but it will likely be larger than the original datastructure(but it depends on how you encode it)... but if you need to store it on disk then you need to store it on disk... regardless of whether or not it takes more space – Joran Beasley Oct 22 '18 at 20:05
  • 1
    Hi Kevin, when posting you should avoid asking multiple questions in one post. Also, in memory is not the same as on disk storage. – d_kennetz Oct 22 '18 at 20:05
  • Thank you Joran. But if I have a list of size 256 bytes (as seen with getsizeof after running script), will it be 256 bytes on disk as well? If I save it, I mean – Jones Oct 22 '18 at 20:06
  • it depends on how you save it ... but probably not ... it will likely have some overhead in addition to the 256 bytes – Joran Beasley Oct 22 '18 at 20:09
  • Ok d_kennetz, I will – Jones Oct 22 '18 at 20:09
  • And What would you say is the most memory-effective way to store lists of floats, in hdf5? – Jones Oct 22 '18 at 20:11
  • why is it important? disk space is basically free ... i cant believe you need to eek every last byte out of your hard disk ... but the most efficient is _**probably**_ using struct pack ... but its going to be harder to use than something like hd5 or pickle or json ... – Joran Beasley Oct 22 '18 at 20:12
  • In theory disk and memory is the same, they're all just bits that can be used to represent whatever. In practice there are a lot of difference. To get good answers it would help if you explained a bit more what it is you're trying to do/understand. – Bastiaan Oct 22 '18 at 20:16
  • Thank you guys, Well, in one of my scripts, I have create a list of 10^12 float numbers and I getsizeof() 800MB and I would like to know if I can store it on disk with same size or at least not very far from same size. – Jones Oct 22 '18 at 20:28
  • @Kevin: without information on *how* you are storing those numbers we can't actually help you figure out how much space that'll take. You can find out by storing, say, 100 floats and seeing how much space that takes. Then go to 1000 floats and compare with your notes, etc. – Martijn Pieters Oct 22 '18 at 20:31
  • @Kevin: and perhaps you need to store those floats in a specific format, or need to be able to share the file with another piece of software that only can support specific formats. And perhaps your floats have a lot of small numbers or even a lot them are 0 or there are other patterns, and compression would really be able to reduce the size, etc. Perhaps storing them as 4-byte `float` is enough precision and you can get away with only needing < 4TB for storing 10**12 uncompressed float numbers. – Martijn Pieters Oct 22 '18 at 20:36
  • @Kevin: (And no, I don't believe you created a list with 10**12 float numbers in Python alone, that'd take 22TB of memory, a single float on a 64-bit OS easily takes 24 bytes, 800MB is about 35 million floats). – Martijn Pieters Oct 22 '18 at 20:36
  • Yes, I thought that I had probably done something wrong cause it looked very unlikely to me either – Jones Oct 22 '18 at 20:40

1 Answers1

3

No, in-memory structures and data written to disk will almost certainly have different sizes. That's because Python objects in memory track information that is not needed when persisting (a reference count, a type pointer, weak references if the type supports it, etc.), and on-disk storage just aims at very different use cases.

For example, depending on the highest Unicode codepoint, Python strings use 1, 2 or 4 bytes per character, because that is the best trade-off to be made to make string operations efficient. But if you store that same text as UTF-8 encoded data on disk, then the variable-width encoding used means you will almost certainly need less space for the same information.

You don't specify how you are saving floats to disk, but how much space is taken up depends entirely on the storage format chosen. Floats could be written as text (writing out ASCII digits to a CSV or JSON file) or as binary C struct data, or as pickled data, or yet some other format that will have specific properties that make it suitable for specific needs. It depends on the format how much space the information will take up.

Just focus on the data written to disk, by researching the format used. Floating point numbers stored as C doubles take up 8 bytes per value, for example.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you Martijn, What I don't get is that if you store a list in-memory, you are indeed storing it, so independently of the ability to access it quicker then on disk, why would be the size different, isn't it a way to store it the same way on disk as in-memory? – Jones Oct 22 '18 at 20:20
  • @Kevin: it's just bits in one medium or another, so yes, you could store the exact same information on disk. But *it would be useless*, because what is stored won't necessarily translate to something you can use again later on. – Martijn Pieters Oct 22 '18 at 20:21
  • 1
    @Kevin: a list is an array of references, plus some metadata (like the reference count I mentioned). Those references are numbers that are addresses of other pieces of memory. That's not something that translates to disk, as a disk won't have the same addresses, whatever is referenced by the list (floats, strings, other lists) would be stored differently too. – Martijn Pieters Oct 22 '18 at 20:22
  • 1
    @Kevin: floats are relatively simple. The Python object holds more information that you don't need though. You can use the `struct` module to store floats as `double` values, 8 bytes each. That's exactly what is stored in memory, but with extra things you never need in another program when you read the info from disk again. So why store that too? – Martijn Pieters Oct 22 '18 at 20:24
  • @Kevin: put differently: when you copy a file from your harddisk to a USB drive to take to a friend, would you also want to copy the full path to the directory on your harddisk? Would you copy the owner, the creation date, the preview information your OS caches for the file to show in your graphical user interface? That's 'stuff' that's there for all sorts of needs that you wouldn't need to bring on the USB stick. Python and file storage are the same, different pieces of info you don't need to bring along. – Martijn Pieters Oct 22 '18 at 20:26
  • Thank you again for your explanation. Just one doubt, floats are suppose to be 8 bytes in size but in one of my scripts, I have create a list of 10^12 float numbers and I getsizeof() 800MB. I didn't know it was even possible and I still don't know why but I would like to know if I can store it on disk with same size or at least not very far from same size (even double size would be acceptable). What format would you recommend? – Jones Oct 22 '18 at 20:36
  • 1
    @Kevin: you can't have created 10**12 floats. With 24 bytes for a single floating point number object in Python, you'd need 22TB of memory for the floats alone. – Martijn Pieters Oct 22 '18 at 20:39
  • 1
    @Kevin: `sys.getsizeof()` on a list only gives you the size of the list object, by the way, which is the size of a reference times the length, plus some overhead. So a list with 10 ** 8 elements and 8 bytes for a reference would take 800MB of memory, *without measuring the contents that have been referenced*. – Martijn Pieters Oct 22 '18 at 20:50
  • I have edited my question to add a screenshot of my CMD, your explanation makes sense as what I created was a list of 10 ** 8 lists with 10 ** 4 elements each and it looks like this is why getsizeof() gives me around 800MB but I don't understand why I can store this quantity of float numbers anyway (10 ** 12), I only have 1TB on my computer. – Jones Oct 23 '18 at 15:18
  • Sorry, I had to roll that back. Stack Overflow is not a forum and is not really suited for continued follow-up questions. – Martijn Pieters Oct 23 '18 at 15:24
  • you don't show how the lists are created, but I'm betting that `listest[0] is listest[1]` is true and that, in reality, you have 10\*\*8 references to a single list object, not 10\*\*8 separate lists. That makes 10 \*\* 8 + 10 \*\* 4 references and 10 \*\* 4 float numbers, which is still about 800MB. Not 10 \*\* 12 references and float numbers. – Martijn Pieters Oct 23 '18 at 15:26
  • @Kevin: see [List of lists changes reflected across sublists unexpectedly](//stackoverflow.com/q/240178) – Martijn Pieters Oct 23 '18 at 15:33