0

I am classifying text using the bag of words model. I read in 800 text files, each containing a sentence.

The sentences are then represented like this:

[{"OneWord":True,"AnotherWord":True,"AndSoOn":True},{"FirstWordNewSentence":True,"AnSoOn":True},...]

How many dimensions does my data have?

Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?

user3813234
  • 1,580
  • 1
  • 29
  • 44

1 Answers1

1

For each doc, the bag of words model has a set of sparse features. For example (use your first sentence in your example):

OneWord
AnotherWord
AndSoOn

The above three are the three active features for the document. It is sparse because we never list those inactive features explicitly AND we have a very large vocabulary (all possible unique words that you consider as features). In another words, we did not say:

OneWord
AnotherWord
AndSoOn
FirstWordNewSentence: false

We only include those words that are "true".

How many dimensions does my data have? Is it the number of entries in the largest vector? Or is it the number of unique words? Or something else?

If you stick with the sparse feature representation, you might want to estimate the average number of active features per document instead. That number is 2.5 in your example ((3+2)/2 = 2.5).

If you use a dense representation (e.g., one-hot encoding, it is not a good idea though if the vocabulary is large), the input dimension is equal to your vocabulary size.

If you use a word embedding that has 100-dimension and combine all words' embedding to form a new input vector to represent a document, your input dimension is 100 then. In this case, you convert your sparse features into dense features via the embedding.

Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
  • Thanks a lot for your answer! Just to clarify: if I use word embedding, I specify the number of dimension? So i could also use 50 or any other number depending on my data? – user3813234 Nov 08 '16 at 12:52
  • If you use a pre-learned embedding from elsewhere, you would need to use the original embedding dimension. Otherwise, if you learn it yourself, it is your call to specify the dimension. Of course, you could use 50 or any other number depending on your own needs. – greeness Nov 08 '16 at 16:55