I would like to have unordered bigrams for example: "the cat sat on the mat"
[("cat","the"),("cat","sat"),("on","sat"),("on","the"),("mat","the")]
each bigram is ordered in alphabetical order - this means, for example, "to house to" will give [("house", "to"),("house","to")]
which will give a higher frequency for these bigrams whilst minimising the search space.
I am able to get the above using:
unordered_bigrams = [tuple(sorted(n)) for n in list(nltk.bigrams(words))]
But I would now like to have a "bag-of-words" type vector for these.
I have ordered bigram feature vectors using:
o_bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
So would like the same for my unordered bigrams... I'm struggling to find an option in CountVectorizer that can give me this processing option (I've looked at vocabulary and preprocessor without much luck)