Here is a small example using ngrams
from the nltk
. Hope it helps:
from nltk.util import ngrams
from nltk import word_tokenize
# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence',
'this is the second sentence',
'third sent of the dataframe']})
print(df)
Input dataframe
:
text
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
Now we can use ngrams along with word_tokenize
for bigrams
and trigrams
and applying this to each row of the dataframe. For bigram we pass value of 2
to ngrams function along with tokenized words whereas, value of 3
is passed for the trigrams. The result returned by ngrams
is of type generator
so, it is converted to list. For each row, list of bigrams
and trigrams
are saved in different columns.
df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)
Result:
text \
0 my first sentence
1 this is the second sentence
2 third sent of the dataframe
bigram \
0 [(my, first), (first, sentence)]
1 [(this, is), (is, the), (the, second), (second, sentence)]
2 [(third, sent), (sent, of), (of, the), (the, dataframe)]
trigram
0 [(my, first, sentence)]
1 [(this, is, the), (is, the, second), (the, second, sentence)]
2 [(third, sent, of), (sent, of, the), (of, the, dataframe)]