I am new to Python and have been trying to find out bag of words. I used vectorizer.fit_transform function as follows
vectorizer = CountVectorizer(vocabulary=set_of_words, tokenizer=nltk.word_tokenize)
bag_of_words = vectorizer.fit_transform(doc).toarray().astype(np.float64)
where doc contains the text whose bag of words is to be extracted.
and i get a warning as follows:/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2499: hereVisibleDeprecationWarning:
rankis deprecated; use the
ndimattribute or function instead. To find the rank of a matrix see
numpy.linalg.matrix_rank`.
VisibleDeprecationWarning)
On displaying vectorizer I get something like this
CountVectorizer(analyzer=u'word', binary=False, charset=None,
charset_error=None, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=<function word_tokenize at 0xafbc6f4>,
vocabulary=[u'dissolution', u'comparatively', u'desirable', u'four', u'obstruction', u'nursery', u'perverted', u'appetite', u'repress', u'consider'])