4

I have written a python script that read the contents of two files, the first is a relatively small file (~30KB) and the second is a larger file ~270MB. The contents of both files are loaded into a dictionary data structure. When the second file is loaded I would have expected the amount of RAM required to be roughly equivalent to the size of the file on disk, perhaps with some overhead, but watching the RAM usage on my PC it seems to consistently take ~2GB (around 8 times the size of the file). The relevant source code is below (pauses inserted just so I can see the RAM usage at each stage). The line consuming large amounts of memory is "tweets = map(json.loads, tweet_file)":

def get_scores(term_file):
    global scores
    for line in term_file:
        term, score  = line.split("\t") #tab character
        scores[term] = int(score)

def pause():
    tmp = raw_input('press any key to continue: ')

def main():
    # get terms and their scores..
    print 'open word list file ...'
    term_file = open(sys.argv[1])
    pause()
    print 'create dictionary from word list file ...'
    get_scores(term_file)
    pause()
    print 'close word list file ...'
    term_file.close
    pause()

    # get tweets from file...
    print 'open tweets file ...'
    tweet_file = open(sys.argv[2])
    pause()
    print 'create dictionary from word list file ...'
    tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
    pause()
    print 'close tweets file ...'
    tweet_file.close
    pause()

Does anyone know why this is? My concern is that I would like to extend my research to larger files, but will fast run out of memory. Interestingly, the memory usage does not seem to increase noticeably after opening the file (as I think this just creates a pointer).

I have an idea to try looping through the file one line at a time and processing what I can and only storing the minimum that I need for future reference rather than loading everything into a list of dictionaries, but I was just interested to see if the approx 8 times multiplier on file size to memory when creating a dictionary is in line with other peoples experience?

ChrisProsser
  • 12,598
  • 6
  • 35
  • 44

3 Answers3

2

My guess is you have multiple copies on your dictionnary simultaneously stored in memory (on various format). As an example, the line:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)

Will create a new copy (+400~1000MB incl. dictionary the overhead). But your original tweet_file stay in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character use 2 or 4 bytes in memory. Whereas on your file, assuming UTF-8 encoding, most of the characters use only 1 byte. If you working with plain strings in Python 2 the size of the string in memory should be almost the same as the size on the disk. So you will have to find an other explanation.

EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some example:

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof("a")
41
>>> sys.getsizeof("ab")
42

As you see, it appears that each character is encoded as one byte. But:

>>> sys.getsizeof("à")
42

Not for "French" characters. And ...

>>> sys.getsizeof("世")
43
>>> sys.getsizeof("世界")
46

For Japanese, we have 3 bytes per character.

The above results are site dependent -- and are explained by the fact that my system use 'UTF-8' a default encoding. The "size of the string" calculated just above are in fact the "size of the byte string" representing the given text.

If 'json.load' use "unicode" strings, the result are somehow different:

>>> sys.getsizeof(u"")
52
>>> sys.getsizeof(u"a")
56
>>> sys.getsizeof(u"ab")
60
>>> sys.getsizeof(u"世")
56
>>> sys.getsizeof(u"世界")
60

In that case, as you can see, each extra character add 4 extra bytes.


Maybe file object will cache some data? If you want to trigger explicit dellaocation of an object, try to set its reference to None:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
[...]
tweet_file.close()
tweet_file = None

When there is not longer any reference to an object, Python will dellocate it -- and so free the corresponding memory (from the Python heap -- I don't think the memory is returned to the system).

Sylvain Leroux
  • 50,096
  • 7
  • 103
  • 125
  • I have tried this, but it does not affect the amount of memory used. As mentioned in the question I think that the tweet_file object is actually just a pointer to the file and does not store the data itself. – ChrisProsser Jun 26 '13 at 07:11
  • Sorry, my mistake: I was thinking you read the entire file content in "tweet_file". The portion concerning "Unicode" in the answer is still relevant though. – Sylvain Leroux Jun 26 '13 at 07:13
  • BTW, when you speak about RAM usage, it this really "RAM" or virtual memory? Depending on your OS, reading a file could just by mapping some virtual memory page to the actual file on disk. So VM usage increase, but there is no "real" increase of memory consumption. Still depending on your OS, for I/O operation the system will use as much RAM as available to cache data. Don't be too obsessed by the numbers. Reality could be more complex. – Sylvain Leroux Jun 26 '13 at 07:18
  • Thanks, perhaps the point about unicode characters could go some way towards explaining this. I will leave it open for now in case there are any further answers and accept this if not. On the RAM / Virtual memory I believe it is using physical RAM rather than virtual memory. Obviously, this is a good thing for fast access, but bad for running out of it quickly. – ChrisProsser Jun 26 '13 at 08:40
  • 1
    I don't buy the unicode explanation. The code is clearly Python 2 (`print` statements), and the `open` call doesn't specify and encoding, so the strings read from the file would be byte strings, not unicode strings. –  Jun 26 '13 at 09:39
  • @delnan I made a edit to clarify (a little bit) that question of "size of characters" in Python 2. – Sylvain Leroux Jun 26 '13 at 10:07
  • @SylvainLeroux FWIW, the size of a Unicode char is determined by a [compile-time option](http://stackoverflow.com/questions/1446347/how-to-find-out-if-python-is-compiled-with-ucs-2-or-ucs-4), and `json.loads` always seems to load strings as Unicode. Your `sys.getsizeof("à")` is a little misleading, since it depends on the character encoding of your console. If you were using Latin-1, it would only be one byte, but since you're using UTF-8, you're actually entering two 'characters' in the string, although you can only see one. – Aya Jun 26 '13 at 15:07
1

I wrote a quick test script to confirm your results...

import sys
import os
import json
import resource

def get_rss():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024

def getsizeof_r(obj):
    total = 0
    if isinstance(obj, list):
        for i in obj:
            total += getsizeof_r(i)
    elif isinstance(obj, dict):
        for k, v in obj.iteritems():
            total += getsizeof_r(k) + getsizeof_r(v)
    else:
        total += sys.getsizeof(obj)
    return total

def main():
    start_rss = get_rss()
    filename = 'foo'
    f = open(filename, 'r')
    l = map(json.loads, f)
    f.close()
    end_rss = get_rss()

    print 'File size is: %d' % os.path.getsize(filename)
    print 'Data size is: %d' % getsizeof_r(l)
    print 'RSS delta is: %d' % (end_rss - start_rss)

if __name__ == '__main__':
    main()

...which prints...

File size is: 1060864
Data size is: 4313088
RSS delta is: 4722688

...so I'm only getting a four-fold increase, which would be accounted for by the fact that each Unicode char takes up four bytes of RAM.

Perhaps you could test your input file with this script, since I can't explain why you get an eight-fold increase with your script.

Aya
  • 39,884
  • 6
  • 55
  • 55
  • Thanks, I will try this later when I have access to the file again. – ChrisProsser Jun 26 '13 at 09:14
  • Hi, I am having a little difficulty locating the Resource library for Windows, any ideas? – ChrisProsser Jun 26 '13 at 20:39
  • @ChrisProsser Oh. The `resource` module only exists on Linux. You'll have to remove the `get_rss()` function. The `getsizeof` stuff should still work tho'. – Aya Jun 27 '13 at 11:14
  • Sorry for the delay responding, my results are File size is: 277551773 and Data size is: 1070424362 which is also about a four fold increase. I guess it is a mystery where the other GB is going. – ChrisProsser Jul 03 '13 at 19:46
0

Have you considered the memory usage for the keys? If you have lots of small values in your dictionary, the storage for the keys could dominate.

Stefan
  • 8,819
  • 10
  • 42
  • 68
  • I had expected a small overhead for this similar to the storage overhead that you would get for an index on a database, but had not expected this to be more that ~20% of the initial file size – ChrisProsser Jun 26 '13 at 12:08
  • It depends on the length of your lines. If the keys are ints, that's 8 bytes per line on a 64-bit system. If your lines are long, that might not matter, but if your document has lots of short or empty lines, you might find the keys are much bigger than the text. – Stefan Jun 26 '13 at 12:15