4

I have an ASCII file that is essentially a grid of 16-bit signed integers; the file size on disk is approximately 300MB. I do not need to read the file into memory, but do need to store its contents as a single container (of containers), so for initial testing on memory use I tried list and tuples as inner containers with the outer container always as a list via list comprehension:

with open(file, 'r') as f:
    for _ in range(6):
        t = next(f) # skipping some header lines
    # Method 1
    grid = [line.strip().split() for line in f] # produces a 3.3GB container
    # Method 2 (on another run)
    grid = [tuple(line.strip().split()) for line in f] # produces a 3.7GB container

After discussing use of the grid amongst the team, I need to keep it as a list of lists up until a certain point at which time I will then convert it to a list of tuples for program execution.

What I am curious about is how a 300MB file can have its lines stored in a container of containers and have its overall size be 10x the original raw file size. Does each container really occupy that much memory space for holding a single line each?

pstatix
  • 3,611
  • 4
  • 18
  • 40
  • 2
    show the code for how you calculate the size of `grid` – Chris_Rands Jan 10 '18 at 14:13
  • 2
    It's strange that those two would give different sizes, because they are the exact same code. You have to call `tuple(foo_list)` to make it a tuple. – Aaron Klein Jan 10 '18 at 14:15
  • 2
    `[(line.strip().split()) for line in f]` is the __exact__ same thing as `[line.strip().split() for line in f]`, so I wonder how you calculate `grid` size to get different sizes. And actually, if you really have both statements in that order, the second will produce an empty list since `f` has been exhausted by the first one. – bruno desthuilliers Jan 10 '18 at 14:16
  • Maybe [this helps](https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python#519653). – rocksteady Jan 10 '18 at 14:17
  • 3
    In the file, each character costs you one byte. Every python string object has an additional overhead of around 50 bytes: `sys.getsizeof('') -> 49`, `sys.getsizeof('a') -> 50`. That's 50 extra bytes for every cell in your grid. If your average token has a length of `5`, there's your factor `10`. – user2390182 Jan 10 '18 at 14:20
  • @Chris_Rands Thanks, I just thought a second time and came to the same conclusion ;) – rocksteady Jan 10 '18 at 14:22
  • @Chris_Rands No code for that, just native OS support. No other applications running, run a python program and monitory process in Task Manager (goes from 25K to 3.7Million K). – pstatix Jan 10 '18 at 14:24
  • @AaronKlein Made a typo (copy past was not available). I do call `tuple(foo_list)` to build a list of tuples, I don't just surround in `()`. – pstatix Jan 10 '18 at 14:25
  • @rocksteady Doesn't solve the problem. I need to store the content, and reading is not an issue since I never read the file into memory plus build a container from it. – pstatix Jan 10 '18 at 14:26
  • @brunodesthuilliers Made a typo, it isn't just `(foo_list)` its `tuple(foo_list)`. Also, I don't execute them as shown for the reason you stated, just showing which methods I had applied to test. – pstatix Jan 10 '18 at 14:26
  • @schwobaseggl Each cell in the grid ranges from 0 to -9999 (lots of the -9999), so that means the `'-'` gets 50 bytes and each `'9'` gets 50 bytes. Why does python give so many bytes for a string?? – pstatix Jan 10 '18 at 14:32
  • @schwobaseggl A quick brain-lapse reminded me these are 16-bit ints so I changed the grid to `[list(map(int, line.strip().split())) for line in f]` and reduced memory footprint by 1.2GB. – pstatix Jan 10 '18 at 14:33
  • @pstatix No, every token has `1` byte for each character + `~50` bytes overhead for the str object (can be different depending on your system and python version): so '-9999` needs `49 + 5 = 54` bytes. Casting to int certainly saves some memory. An `int` needs ~28 bytes – user2390182 Jan 10 '18 at 14:34
  • @schwobaseggl Okay, tracking now; still, why the additional ~50? If the string could occupy 5 bytes, why not leave it? – pstatix Jan 10 '18 at 14:36
  • 1
    @pstatix these are the inner workings of the C implementation. See e.g. [this blog](https://www.laurentluce.com/posts/python-string-objects-implementation/) – user2390182 Jan 10 '18 at 14:38

1 Answers1

1

If you are concerned about storing data in memory and do not want to use tools outside of the standard library, you might want to take a look at the array module. It is designed to store numbers very efficiently in memory, and the array.array class accept various type codes based on the characteristics of the numbers you want stored. The following is a simple demonstration of how you might want to adapt the module for your use:

#! /usr/bin/env python3
import array
import io
import pprint
import sys

CONTENT = '''\
Header 1
Header 2
Header 3
Header 4
Header 5
Header 6
 0 1 2 3 4 -5 -6 -7 -8 -9 
 -9 -8 -7 -6 -5 4 3 2 1 0 '''


def main():
    with io.StringIO(CONTENT) as file:
        for _ in range(6):
            next(file)
        grid = tuple(array.array('h', map(int, line.split())) for line in file)
    print('Grid takes up', get_size_of_grid(grid), 'bytes of memory.')
    pprint.pprint(grid)


def get_size_of_grid(grid):
    return sys.getsizeof(grid) + sum(map(sys.getsizeof, grid))


if __name__ == '__main__':
    main()
Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117
  • How have I never heard of the `array` lib? Does an `array.array` support indexing and assignment? The grid is 2D, so I may need to access `grid[13][25]` for a value. – pstatix Jan 10 '18 at 14:35
  • @pstatix Yes, you can get items out of an array and set item in an array via indexing. – Noctis Skytower Jan 10 '18 at 14:39
  • Why call `tuple()` on the array instantiation? Do you have to convert it to a `tuple` or `list`? – pstatix Jan 10 '18 at 14:40
  • @pstatix If you are concerned about memory usage, compare the results from the following two statements: (1) `sys.getsizeof(tuple(None for _ in range(10 ** 6)))` and (2) `sys.getsizeof(list(None for _ in range(10 ** 6)))` For each million rows, 244 KB extra is used for a list. If `array` instances do not need to be replaced in the grid, you are better off with using a `tuple`. – Noctis Skytower Jan 10 '18 at 14:46
  • Just ran this solution, 3.3GB down to 194MB. How, how, how have I never heard about `array`?? Documentation is seriously limited though, only discussing member functions of the array class, only a single line mentioning ['Arrays are sequence types and behave very much like lists...'](https://docs.python.org/3.6/library/array.html#module-array). The slice operator even works wonderfully! – pstatix Jan 10 '18 at 14:52