How do I make Python make all identical strings use the same memory?

Question

Possible Duplicate:
What does python intern do, and when should it be used?

I am working with a program in python that has to correlate on an array with millions of string objects. I've discovered that if they all come from the same quoted string, each additional "string" is just a reference to the first, master string. However if the strings are read from a file, and if the strings are all equal, each one still requires new memory allocation.

That is, this takes about 14meg of storage:

a = ["foo" for a in range(0,1000000)]

While this requires more than 65meg of storage:

a = ["foo".replace("o","1") for a in range(0,1000000)]

Now I can make the memory take much less space with this:

s = {"f11":"f11"}
a = [s["foo".replace("o","1")] for a in range(0,1000000)]

But that seems silly. Is there an easier way to do this?

@Maulwurfn, just because the answer is the same doesn't mean the question is the same. — Mark Ransom, Aug 05 '12 at 17:16
How are you measuring the size of the lists? If I use `sys.getsizeof(["foo" for a in range(0,1000000)])` I get the same size as `sys.getsizeof(["foo".replace("o","1") for a in range(0,1000000)])` -- at least in Python 3.2 — the wolf, Aug 05 '12 at 18:54
@JBernardo, I did not store the value of the `replace` operation first because I was intentionally trying to generate *new* strings, rather than lots of references to an old string. — vy32, Aug 16 '12 at 00:47

score 14 · Accepted Answer · answered Aug 05 '12 at 17:31

14

just do an intern(), which tells Python to store and take the string from memory:

a = [intern("foo".replace("o","1")) for a in range(0,1000000)]

This also results around 18MB, same as in the first example.

Also note the comment below, if you use python3. Thx @Abe Karplus

answered Aug 05 '12 at 17:31

erikbstack

12,878
21
81
115

2

Note that in Python 3, `intern` has been renamed `sys.intern`. – Abe Karplus Aug 05 '12 at 17:32
1

+1 I didn't knew about `intern()`. – Ashwini Chaudhary Aug 05 '12 at 17:54
1

Thanks great. Thanks. I didn't know about intern. Yes, I'm using Python3, so I'll need to use sys.intern(). – vy32 Aug 05 '12 at 20:50

score 0 · Answer 2 · answered Aug 05 '12 at 17:21

you can try something like this:

strs=["this is string1","this is string2","this is string1","this is string2",
      "this is string3","this is string4","this is string5","this is string1",
      "this is string5"]
new_strs=[]
for x in strs:
    if x in new_strs:
        new_strs.append(new_strs[new_strs.index(x)]) #find the index of the string
                                                     #and instead of appending the
                                                #string itself, append it's reference.
    else:
        new_strs.append(x)

print [id(y) for y in new_strs]

strings which are identical will now have the same id()

output:

[18632400, 18632160, 18632400, 18632160, 18651400, 18651440, 18651360, 18632400, 18651360]

Nice idea. Unfortunately it's an O(n**2) algorithm which will get really slow as the list gets longer. — Mark Ransom, Aug 05 '12 at 17:23

score -1 · Answer 3 · answered Aug 05 '12 at 17:29

-1

Keeping a dictionary of seen strings should work

new_strs = []
str_record = {}
for x in strs:
    if x not in str_record:
        str_record[x] = x
    new_strs.append(str_record[x])

(Untested.)

answered Aug 05 '12 at 17:29

Abe Karplus

8,350
2
27
24

How do I make Python make all identical strings use the same memory?

3 Answers3