I have a series of files, each one containing counts of words. Each file could have different words. Here's an example:
FileA
word1,20
word2,10
word3,2
FileB:
word1,10
word4,50
word3,5
There are about 20k files and each could have up to tens of thousands of words.
I ultimately want to build a sparse matrix where each row represents a file's word distribution, like what you'd get out of scikit's CountVectorizer.
If word1, word2, word3, word4 are columns, anf FileA and FileB are rows then I would expect to get:
[[20,10,2,0],[10,0,5,50]]
How could I achive that? If possible, I'd also like to be able to include only words that appear in at least N files.