1

Referring to this post. I am wondering how we provide vocabulary of word with space to CountVectorizer model e.g. distributed systems or machine learning? Here is an example:

import numpy as np
from itertools import chain

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

vocabulary = list(map(lambda x: x.split(', '), tags))
vocabulary = list(np.unique(list(chain(*vocabulary))))

We can provide this vocabulary list to the model

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=vocabulary)
print(vec.fit_transform(tags).toarray())

Here, I lost the count of word distributed systems (first column). The result is as follow:

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [0 1 1 0 1 0]]

Do I have to change token_pattern or somewhere else?

Community
  • 1
  • 1
titipata
  • 5,321
  • 3
  • 35
  • 59

1 Answers1

3

I think essentially you have already pre-defined your vocabularies to analyze and you want to tokenize your tags by splitting ', '.

You can trick the CountVectorizer to do that by:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=vocabulary, tokenizer=lambda x: x.split(', '))
print(vec.fit_transform(tags).toarray())

, which gives:

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]
Zichen Wang
  • 1,294
  • 13
  • 22
  • Thanks so much @Zichen, this is what I'm looking for. Makes the problem very handy using `tokenizer` – titipata Jun 17 '16 at 23:13