There's some weird mysterious behavior here.
EDIT This has gotten really long and tangled, and I've edited it like 10 times. The TL/DR is that in the course of processing some text, I've managed to write a function that:
works on individual strings of a list
throws a variety of errors when I try to apply it to the whole list with a list comprehension
throws similar errors when I try to apply it to the whole list with a loop
after throwing those errors, stops working on the individual strings until I re-run the function definition and feed it some sample data, then it starts working again, and finally
turns out to work when I apply it to the whole list with map().
There's an ipython notebook saved as html which displays the whole mess here: http://paul-gowder.com/wtf.html ---I've put a link at the top to jump past some irrelevant stuff. I've also made a[nother] gist that just has the problem code and some sample data, but since this problem seems to throw around a bunch of state somehow, I can't guarantee it'll be reproducible from it: https://gist.github.com/paultopia/402891d05dd8c05995d2
End TL/DR, begin mess
I'm doing some toy text-mining on that old enron dataset, and I have the following set of functions to clean up the emails preparatory to turning them into a document term matrix, after loading nltk stopwords and such. The following uses the email library in python 2.7
def parseEmail(document):
# strip unnecessary headers, header text, etc.
theMessage = email.message_from_string(document)
tofield = theMessage['to']
fromfield = theMessage['from']
subjectfield = theMessage['subject']
bodyfield = theMessage.get_payload()
wholeMsgList = [tofield, fromfield, subjectfield, bodyfield]
# get rid of any fields that don't exist in the email
cleanMsgList = [x for x in wholeMsgList if x is not None]
# now return a string with all that stuff run together
return ' '.join(cleanMsgList)
def lettersOnly(document):
return re.sub("[^a-zA-Z]", " ", document)
def wordBag(document):
return lettersOnly(parseEmail(document)).lower().split()
def cleanDoc(document):
dasbag = wordBag(document)
# get rid of "enron" for obvious reasons, also the .com
bagB = [word for word in dasbag if not word in ['enron','com']]
unstemmed =[word for word in bagB if not word in stopwords.words("english")]
return [stemmer.stem(word) for word in unstemmed]
print enronEmails[0][1]
print cleanDoc(enronEmails[0][1])
First (T-minus half an hour) running this on an email represented as a unicode string produced the expected result: print cleanDoc(enronEmails[0][1])
yielded a list of stemmed words. To be clear, the underlying data enronEmails is a list of [label, message] lists, where label is an integer 0 or 1, and message is a unicode string. (In python 2.7.)
Then at t-10, I added a couple lines of code (since deleted and lost, unfortunately...but see below), with some list comprehensions in them to just extract the messages from the enronEmails, run my cleanup function on them, and then join them back into strings for convenient conversion into document term matrix via sklearn. But the function started throwing errors. So I put my debugging hat on...
First I tried rerunning the original definition and test cell. But when I re-ran that cell, my email parsing function suddenly started throwing an error in the message_from_string method:
AttributeError: 'list' object has no attribute 'message_from_string'
So that was bizarre. This was exactly the same function, called on exactly the same data: cleanDoc(enronEmails[0][1])
. The function was working, on the same data, and I haven't changed it.
So checked to make extra-sure I didn't mutate the data. enronEmails[0][1]
was still a string. Not a list. I have no idea why traceback was of the opinion that I was passing a list to cleanDoc(). I wasn't.
But the plot thickens
So then I went to a make a gist to create a wholly reproducible example for the purpose of posting this SO question. I started with the working part. The gist: https://gist.github.com/paultopia/c8c3e066c39336e5f3c2.
To make sure it was working, first I stuck it in a normal .py file and ran it from command line. It worked.
Then I stuck it in a cell at the bottom of my ipython notebook with all the other stuff in it. That worked too.
Then I tried the parseEmail function on enronEmails[0][1]
. That worked again. Then I went all the way back up to the original cell that was throwing an error not five minutes ago and re-ran it (including the import from sklearn, and including the original definition of all functions). And it freaking worked.
BUT THEN I then went back in and tried again with the list comprehensions and such. And this time, I kept track more carefully of what was going on. Adding the following cells:
1.
def atLeastThreeString(cleandoc):
return ' '.join([w for w in cleandoc if len(w)>2])
print atLeastThreeString(cleanDoc(enronEmails[0][1]))
THIS works, and produces the expected output: a string with words over 2 letters. But then: 2.
justEmails = [email[1] for email in enronEmails]
bigEmailsList = [atLeastThreeString(cleanDoc(email)) for email in justEmails]
and all of a sudden it starts throwing a whole new error, same place in the traceback:
AttributeError: 'unicode' object has no attribute 'message_from_string'
which is extra funny, because I was passing it unicode strings a minute ago and it was doing just fine. And, just to thicken the plot, then going back and rerunning cleanDoc(enronEmails[0][1])
throws the same error
This is driving me insane. How is it possible that creating a new list, and then attempting to run function A on that list, not only throws an error on the new list, but ALSO causes function A to throw an error on data that it was previously working on? I know I'm not mutating the original list...
I've posted the entire notebook in html form here, if anyone wants to see full code and traceback: http://paul-gowder.com/wtf.html The relevant parts start about 2/3 of the way down, at the cells numbered 24-5, where it works, and then the cell numbered 26, where it blows up.
help??
Another edit: I've added some more debugging efforts to the bottom of the above-linked html notebook. As you can see, I've traced the problem down to the act of looping, whether done implicitly in list comprehension form or explicitly. My function works on an individual item in the list of just e-mails, but then fails on every single item when I try to loop over that list, except when I use map() to do it. ???? Has the world gone insane?