-2

I am trying to tokenize each sentence of my pandas series. I try to do as I see in the documentation, using apply, but didn't work:

x.apply(nltk.word_tokenize)

If I just use nltk.word_tokenize(x) didn't work too, because x is not a string. Does someone have any idea?

Edited: x is a pandas series with sentences:

0       A very, very, very slow-moving, aimless movie ...
1       Not sure who was more lost - the flat characte...
2       Attempting artiness with black & white and cle...

With x.apply(nltk.word_tokenize) it returns exactly the same:

0       A very, very, very slow-moving, aimless movie ...
1       Not sure who was more lost - the flat characte...
2       Attempting artiness with black & white and cle...

With nltk.word_tokenize(x) the error is:

TypeError: expected string or bytes-like object
Sheldore
  • 37,862
  • 7
  • 57
  • 71
CAB
  • 125
  • 4
  • 12
  • 1
    When you say it didn't work, are you getting an error? If so could you paste the error here? Also a very small example of what `x` is would be helpful – johnchase Aug 26 '18 at 02:14
  • It you were to run `print(x.apply(nltk.word_tokenize)` this returns the same result? – johnchase Aug 26 '18 at 02:56
  • If nltk.word_tokenize(x) gives you TypeError: expected string or bytes-like object - It could be the case that you have null values. – hinson Aug 26 '18 at 03:32
  • See also https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe – alvas Aug 26 '18 at 04:41
  • what's your output of `type(x)`? – alvas Aug 26 '18 at 04:42
  • @Carolina Bury a minimum viable sample of your code that produces the error would go a long way toward us being able to answer your question. = ) – E. Ducateme Aug 26 '18 at 17:32

2 Answers2

4

Question: are you saving your intermediate results? x.apply() creates a copy of your original Series with the appropriate transformations applied to each element of the Series. See below for an example of how this might be affecting your code...

We'll start by confirming that word_tokenize() works on a sample snippet of text.

>>> import pandas as pd
>>> from nltk import word_tokenize
>>> word_tokenize('hello how are you')   # confirming that word_tokenize works.
['hello', 'how', 'are', 'you']            

Then let's create a Series to play with.

>>> s = pd.Series(['hello how are you',
                   'lorem ipsum isumming lorems',
                   'more stuff in a line'])

>>> print(s)
0              hello how are you
1    lorem ipsum isumming lorems
2           more stuff in a line
dtype: object

Executing word_tokenize using the apply() function on an interactive Python prompt shows that it tokenizes...

But doesn't indicate that this is a copy... not a permanent change to s

>>> s.apply(word_tokenize)
0              [hello, how, are, you]
1    [lorem, ipsum, isumming, lorems]
2          [more, stuff, in, a, line]
dtype: object

In fact, we can print s to show that it is unchanged...

>>> print(s)
0              hello how are you
1    lorem ipsum isumming lorems
2           more stuff in a line
dtype: object

If, instead, we supply a label, in this case wt to the results of the apply() function call it allows us to save the results permanently. Which we can see by printing wt.

>>> wt = s.apply(word_tokenize)
>>> print(wt)
0              [hello, how, are, you]
1    [lorem, ipsum, isumming, lorems]
2          [more, stuff, in, a, line]
dtype: object

Doing this on an interactive prompt allows us to more easily detect such a condition, but running it in a script sometimes means that the fact that a copy was produced will pass silently and without indication.

E. Ducateme
  • 4,028
  • 2
  • 20
  • 30
0

The apply call should work fine. I tried your code and it working fine for me. Can you share the exact code which you are using

    In [16]: s
    Out[16]:
    0     A very, very, very slow-moving, aimless movie
    1    Not sure who was more lost - the flat characte
    dtype: object

    In [17]: s.apply(nltk.word_tokenize)
    Out[17]:
    0    [A, very, ,, very, ,, very, slow-moving, ,, ai...
    1    [Not, sure, who, was, more, lost, -, the, flat...
    dtype: object
Jay Rajput
  • 1,813
  • 17
  • 23