2

I hope I don't have to provide an example set.

I have a 2D array where each array contains a set of words from sentences.

I am using a CountVectorizer to effectively call fit_transform on the whole 2D array, such that I can build a vocabulary of words.

However, I have sentences like:

u'Besides EU nations , Switzerland also made a high contribution at Rs 171 million LOCATION_SLOT~-nn+nations~-prep_besides nations~-prep_besides+made~prep_at made~prep_at+rs~num rs~num+NUMBER_SLOT'

And my current vectorizer is too strict at removing things like ~ and + as tokens. Whereas I want it to see each word in terms of split() a token in the vocab, i.e. rs~num+NUMBER_SLOT should be a word in itself in the vocab, as should made. At the same time, stopwords like the the a (the normal stopwords set) should be removed.

Current vectorizer:

vectorizer = CountVectorizer(analyzer="word",stop_words=None,tokenizer=None,preprocessor=None,max_features=5000)

You can specify a token_pattern but I am not sure which one I could use to achieve my aims. Trying:

token_pattern="[^\s]*"

Leads to a vocabulary of:

{u'': 0, u'p~prep_to': 3764, u'de~dobj': 1107, u'wednesday': 4880, ...}

Which messes things up as u'' is not something I want in my vocabulary.

What is the right token pattern for this type of vocabulary_ I want to build?

Dhruv Ghulati
  • 2,976
  • 3
  • 35
  • 51
  • may be wordpunct_tokenize?? http://www.nltk.org/_modules/nltk/tokenize/regexp.html – RAVI Sep 05 '16 at 00:13
  • That is awesome, do any of these have a remove stopwords option? I basically have to split the string on whitespace (super easy), and then remove any stopwords identified in the tokens. The sequence does matter though I guess. – Dhruv Ghulati Sep 05 '16 at 00:16
  • http://stackoverflow.com/a/19133088/1168680 – RAVI Sep 05 '16 at 00:19
  • I need to replicate the CountVectorizer, because I need to obtain float representations of each sentence. I need the `vocabulary_` methodology of the scikit learn vectorizer to fit a vocab to all my sentences, and then return for each sentence something like `[1 0 0 0 0 0 1 0 0 2 0 0 ]`. Perhaps I can call the `WhiteSpaceTokenizer` somehow in the `tokenizer` argument? The `CountVectorizer` has a remove stopwords verrsion, hence I think its the best version, I'm just getting this whitespace removal wrong (it is allowing in `u' '`). – Dhruv Ghulati Sep 05 '16 at 00:30
  • How/why? The CountVectorizer already deals with stopwords which is great,it just needs the right string pattern to split on. The problem here is the regex pattern. For some reason `token_pattern="[^\s]*"` as an argument is letting in `u''`. Edited the question so clearer. – Dhruv Ghulati Sep 05 '16 at 00:40
  • I think there is some other issue for `u'': 0` you are getting count as 0. – RAVI Sep 05 '16 at 02:02
  • That isn't the count, it's the dictionary index created for the vocabulary. – Dhruv Ghulati Sep 05 '16 at 08:45
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/122644/discussion-between-ravi-and-dhruv-ghulati). – RAVI Sep 05 '16 at 08:59

1 Answers1

2

I have figured this out. The vectorizer was allowing 0 or more non-whitespace items - it should allow 1 or more. The correct CountVectorizer is:

CountVectorizer(analyzer="word",token_pattern="[\S]+",tokenizer=None,preprocessor=None,stop_words=None,max_features=5000)
Dhruv Ghulati
  • 2,976
  • 3
  • 35
  • 51