Referring to this post. I am wondering how we provide vocabulary of word with space to CountVectorizer
model e.g. distributed systems
or machine learning
? Here is an example:
import numpy as np
from itertools import chain
tags = [
"python, tools",
"linux, tools, ubuntu",
"distributed systems, linux, networking, tools",
]
vocabulary = list(map(lambda x: x.split(', '), tags))
vocabulary = list(np.unique(list(chain(*vocabulary))))
We can provide this vocabulary list to the model
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=vocabulary)
print(vec.fit_transform(tags).toarray())
Here, I lost the count of word distributed systems
(first column). The result is as follow:
[[0 0 0 1 1 0]
[0 1 0 0 1 1]
[0 1 1 0 1 0]]
Do I have to change token_pattern
or somewhere else?