I am vectorizing a text blob with tokens that have the following style:
hi__(how are you), 908__(number code), the__(POS)
As you can see the tokens have attached some information with __(info)
, I am extracting key words using tfidf, as follows:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(doc)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
The problem is that when I do the above procedure for extracting keywords, I am suspecting that the vectorizer object is removing the parenthesis from my textblob. Thus, which parameter from the tfidf vectorizer object can I use in order to preserve such information in the parenthesis?
UPDATE
I also tried to:
from sklearn.feature_extraction.text import TfidfVectorizer
def dummy_fun(doc):
return doc
tfidf = TfidfVectorizer(
analyzer='word',
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
and
from sklearn.feature_extraction.text import TfidfVectorizer
def dummy_fun(doc):
return doc
tfidf = TfidfVectorizer(
tokenizer=dummy_fun,
preprocessor=dummy_fun,
token_pattern=None)
However, this returns me a sequence of characters instead of tokens that I already tokenize:
['e', 's', '_', 'a', 't', 'o', 'c', 'r', 'i', 'n']