TL;DR
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
stoplist = stopwords.words('english') + list(punctuation)
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
See Basic NLP with NLTK to understand how it isautomatically lowercase, "tokenize" and remove stopwords.
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
If the ngramization is in your preprocessing step, just overwrite the analyzer
argument
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
[out]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
You've misunderstood the meaning of the vocabulary
argument in the CountVectorizer
.
From the docs:
vocabulary
:
Mapping or iterable, optional Either a Mapping (e.g., a dict) where
keys are terms and values are indices in the feature matrix, or an
iterable over terms. If not given, a vocabulary is determined from the
input documents. Indices in the mapping should not be repeated and
should not have any gap between 0 and the largest index.
That means that you will only consider whatever is in the vocabulary as your feature_name
. If you need bigrams in your feature set, then you need to have bigrams in your vocabulary
It doesn't generate the ngrams and then check whether the ngrams only contains words from your vocabulary.
In code, you see that if you add bigrams in your vocabulary, then they will appear in the feature_names()
:
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
[out]:
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
So how do I get bigrams in my feature names based on a list of single words (unigrams)?
One possible solution: You have to write your own analyzer with the ngram generation and check that the ngrams generated are in your list of words you want to keep, e.g.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()