2

I have a series of files, each one containing counts of words. Each file could have different words. Here's an example:

FileA

word1,20
word2,10
word3,2

FileB:

word1,10
word4,50
word3,5

There are about 20k files and each could have up to tens of thousands of words.

I ultimately want to build a sparse matrix where each row represents a file's word distribution, like what you'd get out of scikit's CountVectorizer.

If word1, word2, word3, word4 are columns, anf FileA and FileB are rows then I would expect to get:

[[20,10,2,0],[10,0,5,50]]

How could I achive that? If possible, I'd also like to be able to include only words that appear in at least N files.

ADJ
  • 4,892
  • 10
  • 50
  • 83
  • There's a well-regarded answer to http://stackoverflow.com/questions/1938894/csv-to-sparse-matrix-in-python?rq=1. The N-file requirement is the tricky one, I think. Generate *two* matrices, one with word count and one with file count, and use the latter as a mask on the former later? You'd be able to adjust N relatively easily, which seems useful. – cphlewis Apr 14 '15 at 17:14

1 Answers1

1

You could use some dictionaries for mapping words to how often they appear and file names to the word counts in those files.

files = ["file1", "file2"]
all_words = collections.defaultdict(int)
all_files = collections.defaultdict(dict)

for filename in files:
    with open(filename) as f:
        for line in f:
            word, count = line.split(",")
            all_files[filename][word] = int(count)
            all_words[word] += 1

Then you can use those in a nested list comprehension to create the sparse matrix:

>>> [[all_files[f].get(w, 0) for w in sorted(all_words)] for f in files]
[[20, 10, 2, 0], [10, 0, 5, 50]]

Or for filtering by minimum word count:

>>> [[all_files[f].get(w, 0) for w in sorted(all_words) if all_words[w] > 1] for f in files]
[[20, 2], [10, 5]]
tobias_k
  • 81,265
  • 12
  • 120
  • 179