0

There's some weird mysterious behavior here.

EDIT This has gotten really long and tangled, and I've edited it like 10 times. The TL/DR is that in the course of processing some text, I've managed to write a function that:

  • works on individual strings of a list

  • throws a variety of errors when I try to apply it to the whole list with a list comprehension

  • throws similar errors when I try to apply it to the whole list with a loop

  • after throwing those errors, stops working on the individual strings until I re-run the function definition and feed it some sample data, then it starts working again, and finally

  • turns out to work when I apply it to the whole list with map().

There's an ipython notebook saved as html which displays the whole mess here: http://paul-gowder.com/wtf.html ---I've put a link at the top to jump past some irrelevant stuff. I've also made a[nother] gist that just has the problem code and some sample data, but since this problem seems to throw around a bunch of state somehow, I can't guarantee it'll be reproducible from it: https://gist.github.com/paultopia/402891d05dd8c05995d2

End TL/DR, begin mess

I'm doing some toy text-mining on that old enron dataset, and I have the following set of functions to clean up the emails preparatory to turning them into a document term matrix, after loading nltk stopwords and such. The following uses the email library in python 2.7

def parseEmail(document):
    # strip unnecessary headers, header text, etc.
    theMessage = email.message_from_string(document)
    tofield = theMessage['to']
    fromfield = theMessage['from']
    subjectfield = theMessage['subject']
    bodyfield = theMessage.get_payload()
    wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
    # get rid of any fields that don't exist in the email
    cleanMsgList = [x for x in wholeMsgList if x is not None]
    # now return a string with all that stuff run together
    return ' '.join(cleanMsgList)

def lettersOnly(document):
    return re.sub("[^a-zA-Z]", " ", document)

def wordBag(document):
    return lettersOnly(parseEmail(document)).lower().split()

def cleanDoc(document):
    dasbag = wordBag(document)
    # get rid of "enron" for obvious reasons, also the .com
    bagB = [word for word in dasbag if not word in ['enron','com']]
    unstemmed =[word for word in bagB if not word in stopwords.words("english")]
    return [stemmer.stem(word) for word in unstemmed]

print enronEmails[0][1]

print cleanDoc(enronEmails[0][1])

First (T-minus half an hour) running this on an email represented as a unicode string produced the expected result: print cleanDoc(enronEmails[0][1]) yielded a list of stemmed words. To be clear, the underlying data enronEmails is a list of [label, message] lists, where label is an integer 0 or 1, and message is a unicode string. (In python 2.7.)

Then at t-10, I added a couple lines of code (since deleted and lost, unfortunately...but see below), with some list comprehensions in them to just extract the messages from the enronEmails, run my cleanup function on them, and then join them back into strings for convenient conversion into document term matrix via sklearn. But the function started throwing errors. So I put my debugging hat on...

First I tried rerunning the original definition and test cell. But when I re-ran that cell, my email parsing function suddenly started throwing an error in the message_from_string method:

AttributeError: 'list' object has no attribute 'message_from_string'

So that was bizarre. This was exactly the same function, called on exactly the same data: cleanDoc(enronEmails[0][1]). The function was working, on the same data, and I haven't changed it.

So checked to make extra-sure I didn't mutate the data. enronEmails[0][1] was still a string. Not a list. I have no idea why traceback was of the opinion that I was passing a list to cleanDoc(). I wasn't.

But the plot thickens

So then I went to a make a gist to create a wholly reproducible example for the purpose of posting this SO question. I started with the working part. The gist: https://gist.github.com/paultopia/c8c3e066c39336e5f3c2.

To make sure it was working, first I stuck it in a normal .py file and ran it from command line. It worked.

Then I stuck it in a cell at the bottom of my ipython notebook with all the other stuff in it. That worked too.

Then I tried the parseEmail function on enronEmails[0][1]. That worked again. Then I went all the way back up to the original cell that was throwing an error not five minutes ago and re-ran it (including the import from sklearn, and including the original definition of all functions). And it freaking worked.

BUT THEN I then went back in and tried again with the list comprehensions and such. And this time, I kept track more carefully of what was going on. Adding the following cells:

1.

def atLeastThreeString(cleandoc):
    return ' '.join([w for w in cleandoc if len(w)>2])
print atLeastThreeString(cleanDoc(enronEmails[0][1]))

THIS works, and produces the expected output: a string with words over 2 letters. But then: 2.

justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]

and all of a sudden it starts throwing a whole new error, same place in the traceback:

AttributeError: 'unicode' object has no attribute 'message_from_string'

which is extra funny, because I was passing it unicode strings a minute ago and it was doing just fine. And, just to thicken the plot, then going back and rerunning cleanDoc(enronEmails[0][1]) throws the same error

This is driving me insane. How is it possible that creating a new list, and then attempting to run function A on that list, not only throws an error on the new list, but ALSO causes function A to throw an error on data that it was previously working on? I know I'm not mutating the original list...

I've posted the entire notebook in html form here, if anyone wants to see full code and traceback: http://paul-gowder.com/wtf.html The relevant parts start about 2/3 of the way down, at the cells numbered 24-5, where it works, and then the cell numbered 26, where it blows up.

help??

Another edit: I've added some more debugging efforts to the bottom of the above-linked html notebook. As you can see, I've traced the problem down to the act of looping, whether done implicitly in list comprehension form or explicitly. My function works on an individual item in the list of just e-mails, but then fails on every single item when I try to loop over that list, except when I use map() to do it. ???? Has the world gone insane?

Paul Gowder
  • 2,409
  • 1
  • 21
  • 36

1 Answers1

2

I believe the problem is these staements:

justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails] 

In python 2, the dummy variable email leaks out into the namespace, and so you are overwriting the name of the email module, and you are then trying to call a method from that module on a python string. I don't have ntlk in python 2, so I cant test it, but I think this must be it.

saulspatz
  • 5,011
  • 5
  • 36
  • 47
  • oooh! that might well explain it. Gonna go test that out, but, after searching, just to add to the record for future searches, this behavior is described here: http://stackoverflow.com/questions/4575698/python-list-comprehension-overriding-value – Paul Gowder Sep 12 '15 at 21:29
  • Just for the record, the list comprehensions wouldn't be a problem in python 3, but the for loop with loop variable email still would cause the problem. – saulspatz Sep 12 '15 at 21:36
  • You nailed it. Importing the email module under a different name works seamlessly. Thank you. I wish I could vote that up/mark it as accepted twice. Also, I think you may have just converted me to python 3. – Paul Gowder Sep 12 '15 at 21:55
  • Glad to help. Python 3 is a cleaner, more efficient language, but it takes a while to get to `print` as a function. – saulspatz Sep 12 '15 at 22:06