17

Just a fundamental question regarding python and .join() method:

file1 = open(f1,"r")
file2 = open(f2,"r")
file3 = open("results","w")

diff = difflib.Differ()
result = diff.compare(file1.read(),file2.read())
file3.write("".join(result)),

The above snippet of code yields a nice output stored in a file called "results", in string format, showing the differences between the two files line-by-line. However I notice that if I just print "result" without using .join(), the compiler returns a message that includes a memory address. After trying to write the result to the file without using .join(), I was informed by the compiler that only strings and character buffers may be used in the .join() method, and not generator objects. So based off of all the evidence that I have adduced, please correct me if I am wrong:

  1. result = diff.compare(file1.read(),file2.read()) <---- result is a generator object?

  2. result is a list of strings, with result itself being the reference to the first string?

  3. .join() takes a memory address and points to the first, and then iterates over the rest of the addresses of strings in that structure?

  4. A generator object is an object that returns a pointer?

I apologize if my questions are unclear, but I basically wanted to ask the python veterans if my deductions were correct. My question is less about the observable results, and more so about the inner workings of python. I appreciate all of your help.

eazar001
  • 1,572
  • 2
  • 16
  • 29
  • 1
    You don't have a memory address; python gives you a representation of an object, and the default for custom objects is to show the type and memory address of the object. There is still an object there. – Martijn Pieters Jan 21 '13 at 21:00

1 Answers1

40

join is a method of strings. That method takes any iterable and iterates over it and joins the contents together. (The contents have to be strings, or it will raise an exception.)

If you attempt to write the generator object directly to the file, you will just get the generator object itself, not its contents. join "unrolls" the contents of the generator.

You can see what is going with a simple, explicit generator:

def gen():
    yield 'A'
    yield 'B'
    yield 'C'

>>> g = gen()
>>> print g
<generator object gen at 0x0000000004BB9090>
>>> print ''.join(g)
ABC

The generator doles out its contents one at a time. If you try to look at the generator itself, it doesn't dole anything out and you just see it as "generator object". To get at its contents, you need to iterate over them. You can do this with a for loop, with the next function, or with any of various other functions/methods that iterate over things (str.join among them).

When you say that result "is a list of string" you are getting close to the idea. A generator (or iterable) is sort of like a "potential list". Instead of actually being a list of all its contents all at once, it lets you peel off each item one at a time.

None of the objects is a "memory address". The string representation of a generator object (like that of many other objects) includes a memory address, so if you print it (as above) or write it to a file, you'll see that address. But that doesn't mean that object "is" that memory address, and the address itself isn't really usable as such. It's just a handy identifying tag so that if you have multiple objects you can tell them apart.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • 2
    Note that `join` assumes that the iterable contains/yields only strings. It will complain if that isn't the case... – mgilson Jan 21 '13 at 21:00
  • 12
    Interesting fact: Giving `''.join()` a generator is *slower* than giving `''.join()` the result of calling `list()` on a generator. `''.join(list(result))` is faster than `''.join(result)`. – Martijn Pieters Jan 21 '13 at 21:02
  • @Marcin: The message was: – eazar001 Jan 21 '13 at 21:18
  • @BrenBarn, great, so .join() is basically setting up a pointer that iterates over a generator, and access the contents and adjoin them. It kind of reminds me of a simpler form of accessing a character array in C. Thanks again! – eazar001 Jan 21 '13 at 21:29
  • @eazar001: You don't really want to think of it in terms of pointers. Basically a generator provides a certain API (the iterator protocol), and `join` makes use of that API to get the generator to release its contents. – BrenBarn Jan 21 '13 at 21:49
  • @BrenBarn, Okay, I'll make sure to get that out of my head, I guess I should have started emptying pointers from my head when the first few posts stated 'no memory addresses.' It's just really difficult for me to link "contents" and "iteration" without thinking of pointers I suppose. Thank you for clearing that up though. – eazar001 Jan 21 '13 at 22:14
  • @MartijnPieters a quick benchmark in iPython shows this to be false. Why would you think otherwise? – Phob Mar 13 '17 at 18:50
  • 1
    @Phob see http://stackoverflow.com/a/9061024/100297. Check your benchmark, you are probably exhausting your generator in the first run. – Martijn Pieters Mar 13 '17 at 18:54