4

I'm a little puzzled how Python allocates memory and garbage-collects, and how that is platform-specific. For example, When we compare the following two code snippets:

Snippet A:

>>> id('x' * 10000000) == id('x' * 10000000)
True

Snippet B:

>>> x = "x"*10000000
>>> y = "x"*10000000
>>> id(x) == id(y)
False

Snippet A returns true because when Python allocates memory, it allocates it in the same location for the first test, and in different locations in the second test, which is why their memory locations are different.

But apparently system performance or platform impacts this, because when I try this on a larger scale:

for i in xrange(1, 1000000000):
    if id('x' * i) != id('x' * i):
        print i
        break

A friend on a Mac tried this, and it ran until the end. When I ran it on a bunch of Linux VMs, it would invariably return (but at different times) on different VMs. Is this because of the scheduling of the garbage collection in Python? Was it because my Linux VMs had less processing speed than the Mac, or does the Linux Python implementation garbage-collect differently?

pbanka
  • 716
  • 1
  • 8
  • 20
  • Here is another helpful discussion: [why-is-keyword-has-different-behavior-when-there-is-dot-in-the-string](http://stackoverflow.com/questions/2858603/why-is-keyword-has-different-behavior-when-there-is-dot-in-the-string) – pbanka Nov 01 '12 at 16:22

3 Answers3

6

The garbage collector just uses whatever space is convenient. There are lots of different garbage collection strategies, and things are also affected by paramters, different platforms, memory usage, phase of the moon etc. Trying to guess how the interpreter will happen to allocate particular objects is just a waste of time.

Antimony
  • 37,781
  • 10
  • 100
  • 107
5

It happens because python caches small integers and strings :

large strings : stored in variables not cached:

In [32]: x = "x"*10000000

In [33]: y = "x"*10000000

In [34]: x is y
Out[34]: False

large strings : not stored in variables, looks like cached:

In [35]: id('x' * 10000000) == id('x' * 10000000)
Out[35]: True

small strings : cached

In [36]: x="abcd"

In [37]: y="abcd"

In [38]: x is y
Out[38]: True

small integers: Cached

In [39]: x=3

In [40]: y=3

In [41]: x is y
Out[41]: True

large integers:

stored in variables: not cached

In [49]: x=12345678

In [50]: y=12345678

In [51]: x is y
Out[51]: False

not stored: cached

In [52]: id(12345678)==id(12345678)
Out[52]: True
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • That doesn't explain the behavior the question is asking about. – Antimony Oct 30 '12 at 23:18
  • @Antimony the behavior he's observing is result of caching done by python. So I tried to explain that a bit. – Ashwini Chaudhary Oct 30 '12 at 23:24
  • no it's not. Interning explains why it just doesn't return 1 every time. But the question is why it returns varying values on different platforms and sometimes doesn't return at all. – Antimony Oct 30 '12 at 23:58
3

CPython uses two strategies for memory management:

  1. Reference Counting
  2. Mark-and-Sweep Garbage Collection

Allocation is in general done via the platforms malloc/free functions and inherits the performance characteristics of the underlaying runtime. If memory is reused is decided by the operating system. (There are some objects, which are pooled by the python vm)

Your example does, however, not trigger the 'real' GC algorithm (this is only used to collect cycles). Your long string gets deallocated as soon as the last reference is dropped.

ebo
  • 8,985
  • 3
  • 31
  • 37
  • 2
    *CPython* does that. While knowing that is useful for hacks, for explaining the results of individual experiments, etc. it's ultimately as much of an implementation detail as, say, the caching of integers. –  Oct 30 '12 at 20:13
  • True, fixed. I still think to original question was targeted at CPythons behaviour. – ebo Oct 30 '12 at 20:31