Is it possible to have unordered bigrams in a countvectorizer

Question

I would like to have unordered bigrams for example: "the cat sat on the mat"

[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]

each bigram is ordered in alphabetical order - this means, for example, "to house to" will give [("house", "to"),("house","to")] which will give a higher frequency for these bigrams whilst minimising the search space.

I am able to get the above using:
unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))]
But I would now like to have a "bag-of-words" type vector for these.

I have ordered bigram feature vectors using:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))

So would like the same for my unordered bigrams... I'm struggling to find an option in CountVectorizer that can give me this processing option (I've looked at vocabulary and preprocessor without much luck)

score 1 · Answer 1 · edited May 23 '17 at 12:31

You don't really need a bigram generator if all you need are pairs of possible words given an unordered list of words:

>>> from itertools import permutations
>>> words = set("the cat sat on the mat".split())
>>> list(permutations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'on'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'on'), ('mat', 'the'), ('mat', 'sat'), ('mat', 'cat'), ('cat', 'on'), ('cat', 'the'), ('cat', 'sat'), ('cat', 'mat')]

Or if you don't want duplicated tuples with the same words but of different order:

>>> from itertools import combinations
>>> list(combinations(words, 2))
[('on', 'the'), ('on', 'sat'), ('on', 'mat'), ('on', 'cat'), ('the', 'sat'), ('the', 'mat'), ('the', 'cat'), ('sat', 'mat'), ('sat', 'cat'), ('mat', 'cat')]

There's a good answer on product, combination and permutation on https://stackoverflow.com/a/942551/610569

Hi there, I have this already - I'm more concerned with converting these bigrams into feature vectors e.g. [[0,1,1,2,1,...],[...]] that can be used in classification models — charlotte75, Mar 09 '17 at 10:44

Is it possible to have unordered bigrams in a countvectorizer

1 Answers1