0

I am trying generate BiGrams using countvectorizer and attach them back to the dataframe. Howerver Its giving me only unigrams only as outputs. I want to create the bi grams only if the specific keywords are present . I am passing them using vocabulary parameter

What I am trying to achieve is eliminating other words in the text corpus and make n-grams of the list present in the vocabulary dictionary

Input data

 Id Name
    1   Industrial  Floor chenidsd 34
    2   Industrial  Floor room   345
    3   Central District    46
    4   Central Industrial District  Bay
    5   Chinese District Bay
    6   Bay Chinese xrty
    7   Industrial  Floor chenidsd 34
    8   Industrial  Floor room   345
    9   Central District    46
    10  Central Industrial District  Bay
    11  Chinese District Bay
    12  Bay Chinese dffefef
    13  Industrial  Floor chenidsd 34
    14  Industrial  Floor room   345
    15  Central District    46
    16  Central Industrial District  Bay
    17  Chinese District Bay
    18  Bay Chinese grty

NLTK

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))

Vocabulary Defintion

 english_corpus=['bay','central','chinese','district', 'floor','industrial','room']  

Bigram Generator

 cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
    cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
    for i, col in enumerate(cv.get_feature_names()):
        Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

However it gives me only unigram as output.How to fix this.

Output

In[26]:Nata.columns.tolist()
Out[26]:

['Id',
 'Name',
 'bay',
 'central',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']
aeapen
  • 871
  • 1
  • 14
  • 28
  • Please see https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe , you're making the same mistake of re-processing the same columns multiple times. – alvas Dec 13 '17 at 08:54
  • @alvas,How can I make Bi-grams and Uni-grams only if the words in the dictionary occur together – aeapen Dec 13 '17 at 09:07
  • @alvas,What I am trying to achieve is eliminating other words in the text corpus and make n-grams of the list present in the dictionary – aeapen Dec 13 '17 at 09:12
  • See updated answer. – alvas Dec 13 '17 at 09:54
  • @alvas thanks for the for the explanation, how I can modify it to create uni,bigram, trigrams as well from list of dictionary words – aeapen Dec 13 '17 at 12:56

1 Answers1

2

TL;DR

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

stoplist = stopwords.words('english') + list(punctuation)

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

See Basic NLP with NLTK to understand how it isautomatically lowercase, "tokenize" and remove stopwords.

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

If the ngramization is in your preprocessing step, just overwrite the analyzer argument

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()

[out]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

You've misunderstood the meaning of the vocabulary argument in the CountVectorizer.

From the docs:

vocabulary :

Mapping or iterable, optional Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

That means that you will only consider whatever is in the vocabulary as your feature_name. If you need bigrams in your feature set, then you need to have bigrams in your vocabulary

It doesn't generate the ngrams and then check whether the ngrams only contains words from your vocabulary.

In code, you see that if you add bigrams in your vocabulary, then they will appear in the feature_names():

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

[out]:

['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

So how do I get bigrams in my feature names based on a list of single words (unigrams)?

One possible solution: You have to write your own analyzer with the ngram generation and check that the ngrams generated are in your list of words you want to keep, e.g.

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()
alvas
  • 115,346
  • 109
  • 446
  • 738