2

I'm learning python (3.6) and I have discovered the following:

a = "hi"

b = "hi"

a == b #True

a is b #True

a = list(a)

b = list(b)

a = "".join(a)

b = "".join(b)

a == b #True

a is b #False

Why is the result different after conversion to list and joining back to string? I do understand that Python VM maintains a pool of strings and hence the reference is the same for a and b. But why does this not work after joining the list to the very same string?

Thanks!

Carcigenicate
  • 43,494
  • 9
  • 68
  • 117
Saalim
  • 19
  • 2
  • Don't use backticks for large pieces of code. Indent by four spaces instead. – Carcigenicate Oct 24 '17 at 21:21
  • 2
    Probably because string literals are easier to optimize than the results of functions. – Fred Larson Oct 24 '17 at 21:22
  • Using the same string object is an *optimization* when it is easy to do or obvious. Imagine if before creating any new string Python would first check to see if that string already existed - that wouldn't be very optimal, especially if there are a lot of string in existence. – SethMMorton Oct 24 '17 at 21:25
  • 1
    Python makes no promises about when it will or won't reuse objects for equal values of immutable built-in types. You can dig into the implementation if you want, but the more important lesson is learning to ignore this stuff and move on to things that aren't implementation details. – user2357112 Oct 24 '17 at 21:25
  • 2
    Thanks for the clarification. I guess the lesson is never to compare strings using "is" operator. – Saalim Oct 24 '17 at 21:27
  • @Saalim Bingo. You should only use `is` for comparing to `None`, or if you truly want to ensure that the objects are the same. Usually you only care about the contents. – SethMMorton Oct 24 '17 at 21:31

3 Answers3

1

The key lies here:

a = "".join(a)

b = "".join(b)

The string.join() method returns a new string, built by joining the element of a list.

Each call to string.join() instanciates a new string: in the first call a string is created and its reference is assigned to a, then, in the second call, a new string gets built and its reference is assigned to b. Because of this, the two names a and b are references to two new and distinct strings, which themselves are two separate objects.

The is operator behaves as designed, returning false as a and b are not references to the same object.

If you're trying to see if the two string are equal in content, then the operator == is likely a better choice.

mencucci
  • 180
  • 1
  • 10
0

You shouldn't really compare anything except singletons (like None, True or False) with is. Because is doesn't really compare the content, it just checks if it's the same object. So is will fail if you compare different objects with the same content.

The fact that your first a is b worked is because literals are interned (*). So a and b are the same object because both are literals with the same content. But that's an implementation and it could yield different results in future (or older) Python versions, so don't start comparing string literals with is on the basis that it works right now.

(*) It really should return False because the way you've written the cases they shouldn't be the same object. They just happen to be the same one because CPython optimizes some cases.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
0

There are lots of ways to answer this, but here you can think about memory. The physical bits in your RAM that make up the data. In python, the keyword "is" checks to see if the address of two objects matches exactly. The operator "==" checks to see if the value of the objects are the same by running the comparison defined in the magic method - the python code responsible for turning operators into functions - this has to be defined for every class. The interesting part arises from the fact that they are originally identical, this question should help you with that. when does Python allocate new memory for identical strings?.

Essentially python can optimise the "hi" strings because you've typed them before running your code it makes a table of all typed strings to save on memory. When the string object is built from a list, python doesn't know what the contents will be. By default, in your particular version of the interpreter this means a new string object is created - this saves time checking for duplicates, but requires more memory. If you want to force the interpreter to check for duplicates then use the "sys.intern" function: https://docs.python.org/3/library/sys.html

sys.intern(string):

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup – if the keys in a dictionary are interned, and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer compare instead of a string compare. Normally, the names used in Python programs are automatically interned, and the dictionaries used to hold module, class or instance attributes have interned keys. Interned strings are not immortal; you must keep a reference to the return value of intern() around to benefit from it.

Andrew Louw
  • 679
  • 5
  • 10