how to tokenize text like in tidytext?

Question

I am trying to reproduce in Python the exploding tokenization of tidytext

> tibble(text = c('hasta la vista baby',
+                 'I am the terminator'),
+        value = c(1,2)) %>% 
+   unnest_tokens(input = 'text',output = 'word', token = 'words')
# A tibble: 8 x 2
  value word      
  <dbl> <chr>     
1     1 hasta     
2     1 la        
3     1 vista     
4     1 baby      
5     2 i         
6     2 am        
7     2 the       
8     2 terminator

Is it possible to do so in Pandas as well? I am focusing on speed of execution here.

import pandas as pd

pd.DataFrame({'text': ['hasta la vista baby', 'I am the terminator'],
              'value': [1,2]})
Out[3]: 
                  text  value
0  hasta la vista baby      1
1  I am the terminator      2

Thanks!

Similar to [this question](https://stackoverflow.com/questions/62216774/extracting-top-words-by-date/62217094#62217094) — Quang Hoang, Jun 05 '20 at 19:46
`df.assign(text=df['text'].str.split()).explode('text')` in pandas — anky, Jun 05 '20 at 19:47
@anky very interesting, thanks! but I guess the pandas native solution only allows a very simple tokenization (here based on white spaces)... — ℕʘʘḆḽḘ, Jun 05 '20 at 19:52
Not necessarily, you can pass the delimiter inside `str.split()`, example for a comma you would do `df.assign(text=df['text'].str.split(",")).explode('text')` you can check more [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html) — anky, Jun 05 '20 at 19:54
i believe this is similar to [this](https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe/53218939#53218939) do you think this is a dupe? it also covers all versions of pandas — anky, Jun 05 '20 at 19:57
note a dupe because we focus on text here. Perhaps you can specify how to split on sentences as well? — ℕʘʘḆḽḘ, Jun 05 '20 at 19:59
ℕʘʘḆḽḘ not necessarily. If you are tokenising by, for example, tweets which have their own row, you are going to remove punctuation anyway. Alternatively, you can first tokenise by sentence, then explode leaving a row number as reference. @anky many thanks I was not aware of the .explode() before and it works amazingly for what I was looking for — Robert Chestnutt, Feb 05 '23 at 21:17

how to tokenize text like in tidytext?

0 Answers0