1

I have an excel dataset containing usertype, ID and description of properties. I have imported this file in python pandas in dataframe(df).

Now I want to split the contents in desciption into one word, two words and three words. I am able to do one word tokenization with the help of NLTK library. But I am stuck for two and three word tokenization. For example, one of the rows in column Description has sentence-

A brand new residential apartment at mumbai main road with portable water.

I want this sentence to be split as

"A Brand","Brand new","new Residential","residential Apartment"...."portable water".

And this spliting should reflect in every row of that column.

Image of my dataset in excel format

Andrew Lohr
  • 5,380
  • 1
  • 26
  • 38
Rajitha Naik
  • 103
  • 2
  • 11
  • 1
    How about you 1) don't post pictures 2) don't post links to pictures 3) much less links to pictures of _excel_ data. – cs95 Aug 24 '17 at 19:19
  • And read: http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – cs95 Aug 24 '17 at 19:23
  • 1
    There's an `ngrams` function in nltk that does this pretty easily, taking an argument for the number of words you want to group together – kev8484 Aug 24 '17 at 19:24

1 Answers1

1

Here is a small example using ngrams from the nltk. Hope it helps:

from nltk.util import ngrams
from nltk import word_tokenize

# Creating test dataframe
df = pd.DataFrame({'text': ['my first sentence', 
                            'this is the second sentence', 
                            'third sent of the dataframe']})
print(df)

Input dataframe:

    text
0   my first sentence
1   this is the second sentence
2   third sent of the dataframe

Now we can use ngrams along with word_tokenize for bigrams and trigrams and applying this to each row of the dataframe. For bigram we pass value of 2 to ngrams function along with tokenized words whereas, value of 3 is passed for the trigrams. The result returned by ngrams is of type generator so, it is converted to list. For each row, list of bigrams and trigrams are saved in different columns.

df['bigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 2)))
df['trigram'] = df['text'].apply(lambda row: list(ngrams(word_tokenize(row), 3)))
print(df)

Result:

                     text  \
0            my first sentence   
1  this is the second sentence   
2  third sent of the dataframe   

                                                   bigram  \
0                            [(my, first), (first, sentence)]   
1  [(this, is), (is, the), (the, second), (second, sentence)]   
2    [(third, sent), (sent, of), (of, the), (the, dataframe)]   

                                                     trigram  
0                                        [(my, first, sentence)]  
1  [(this, is, the), (is, the, second), (the, second, sentence)]  
2     [(third, sent, of), (sent, of, the), (of, the, dataframe)]  
niraj
  • 17,498
  • 4
  • 33
  • 48