Question: are you saving your intermediate results? x.apply()
creates a copy of your original Series
with the appropriate transformations applied to each element of the Series
. See below for an example of how this might be affecting your code...
We'll start by confirming that word_tokenize()
works on a sample snippet of text.
>>> import pandas as pd
>>> from nltk import word_tokenize
>>> word_tokenize('hello how are you') # confirming that word_tokenize works.
['hello', 'how', 'are', 'you']
Then let's create a Series
to play with.
>>> s = pd.Series(['hello how are you',
'lorem ipsum isumming lorems',
'more stuff in a line'])
>>> print(s)
0 hello how are you
1 lorem ipsum isumming lorems
2 more stuff in a line
dtype: object
Executing word_tokenize
using the apply()
function on an interactive Python prompt shows that it tokenizes...
But doesn't indicate that this is a copy... not a permanent change to s
>>> s.apply(word_tokenize)
0 [hello, how, are, you]
1 [lorem, ipsum, isumming, lorems]
2 [more, stuff, in, a, line]
dtype: object
In fact, we can print s
to show that it is unchanged...
>>> print(s)
0 hello how are you
1 lorem ipsum isumming lorems
2 more stuff in a line
dtype: object
If, instead, we supply a label, in this case wt
to the results of the apply()
function call it allows us to save the results permanently. Which we can see by printing wt
.
>>> wt = s.apply(word_tokenize)
>>> print(wt)
0 [hello, how, are, you]
1 [lorem, ipsum, isumming, lorems]
2 [more, stuff, in, a, line]
dtype: object
Doing this on an interactive prompt allows us to more easily detect such a condition, but running it in a script sometimes means that the fact that a copy was produced will pass silently and without indication.