1

I've seen a number of questions on making histograms in clean one-liners, but I haven't yet found anyone trying to make them as efficiently as possible. I'm currently creating a lot of tfidf vectors for a search algorithm, and this involves creating a number of histograms and my current code, while being very short and readable is not as fast as I would like. Sadly, I've tried a number of other methods that turned out far slower. Can you do it faster? cleanStringVector is a list of strings (all lowercase, no punctuation), and masterWordList is also a list of words that should contain every word within the cleanStringVector.

from collections import Counter
def tfidfVector(cleanStringVector, masterWordList):
    frequencyHistogram = Counter(cleanStringVector)
    featureVector = [frequencyHistogram[word] for word in masterWordList]
    return featureVector

Worth noting that the fact that the Counter object returns a zero for non-existent keys instead of raising a KeyError is a serious plus and most of the histogram methods in other questions fail this test.

Example: If I have the following data:

["apple", "orange", "tomato", "apple", "apple"]
["tomato", "tomato", "orange"]
["apple", "apple", "apple", "cucumber"]
["tomato", "orange", "apple", "apple", "tomato", "orange"]
["orange", "cucumber", "orange", "cucumber", "tomato"]

And a master wordlist of:

["apple", "orange", "tomato", "cucumber"]

I would like a return of the following from each test case respectively:

[3, 1, 1, 0]
[0, 1, 2, 0]
[3, 0, 0, 1]
[2, 2, 2, 0]
[0, 2, 1, 2]

I hope that helps.

Approximate final results:

Original Method: 3.213
OrderedDict: 5.529
UnorderedDict: 0.190
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • What does `cleanStringVector` look like? – chenaren May 23 '13 at 13:31
  • Oh, it's just a list of strings. Right now a straight python list, but assume it's a numpy array if you like. – Slater Victoroff May 23 '13 at 13:32
  • Have you benchmarked the methods [here](http://stackoverflow.com/questions/2870466/python-histogram-one-liner)? – Fredrik Pihl May 23 '13 at 13:35
  • I've run most of them, not all. I eliminated a few that were totally unreadable. This seems to be the fastest among them, but it's still pretty slow. – Slater Victoroff May 23 '13 at 13:37
  • What is the size of cleanStringVector (nr unique words) compared to masterWordList? If it is small why loop through the whole masterWordList? – RickyA May 23 '13 at 13:58
  • The size of cleanStringVector is pretty comparable to masterWordList (maybe 50%), and the reason I'm looping through the whole masterWordList is to feed into scipy's tfidf vector transform, but if there's a better way of getting the same information I would be totally willing to switch over. – Slater Victoroff May 23 '13 at 14:00
  • Why loop through masterWordList at all... What happens if you loop through cleanStringVector and upsert each word in a seperate result dict. – RickyA May 23 '13 at 14:00
  • I need a list of lists, where each list entry corresponds to the frequency of a specific word in the masterWordList. See tfidf weighting in the scipy docs to get a better idea of what my final output has to look like. http://scikit-learn.org/dev/modules/feature_extraction.html – Slater Victoroff May 23 '13 at 14:04
  • You should really include a small sample or subset of data with your question so that we may better help you. Also, what you would like the output to look like would be nice to know. – Inbar Rose May 23 '13 at 14:15
  • The actual data is currently proprietary, but I'll put up an example. – Slater Victoroff May 23 '13 at 14:17
  • 1
    Did you see [this](http://stackoverflow.com/questions/2522152/python-is-a-dictionary-slow-to-find-frequency-of-each-character) one? Although about letters the methods may be useful. – RickyA May 23 '13 at 14:17
  • Ooh, hadn't seen that. Checking to see which of these I can modify to work with words – Slater Victoroff May 23 '13 at 14:27
  • Hmm, it looked promising, but it appears that all of those performance boosts rely on the fact that there are a limited number of characters with a predefined order and a known, limited scope and size. Sadly I have none of these. Thought the problems would be a lot more similar than it seems they are. – Slater Victoroff May 23 '13 at 14:37

2 Answers2

2

This improves the runtime in my unrepresentative micro benchmark by 1 order of magnitude with Python 3:

mapping = dict((w, i) for i, w in enumerate(masterWordList))

def tfidfVector(cleanStringVector, masterWordList):    
    featureVector = [0] * len(masterWordList)
    for w in cleanStringVector:
        featureVector[mapping[w]] += 1
    return featureVector
Thomas Jung
  • 32,428
  • 9
  • 84
  • 114
  • Same idea I was trying to accomplish in my answer.. but much more elegant (and probably faster), nice – qwwqwwq May 23 '13 at 15:10
  • Oh, fantastic. Made some small tweaks for caching locality in the code calling it and managed to get about 20x speed improvement out of this. – Slater Victoroff May 23 '13 at 15:24
0

I think looping through the Master Word list is a problem. Each time you make a histogram you have to hash every word in the master word list (most of these hashes are just missing, a computationally expensive way to return a 0).

I would hash the master wordlist first, then use that hash to create each histogram, this way you only need to hash every word in the stringvector (twice, once to get the counts, and once to reset the master wordlist hash). If the stringvectors are smaller than the master wordlist, this results is many fewer hashing operations:

from itertools import repeat

stringvecs=[["apple", "orange", "tomato", "apple", "apple"],
["tomato", "tomato", "orange"],
["apple", "apple", "apple", "cucumber"],
["tomato", "orange", "apple", "apple", "tomato", "orange"],
["orange", "cucumber", "orange", "cucumber", "tomato"]]

m=["apple", "orange", "tomato", "cucumber"]

md = dict(zip(m, repeat(0)))

def tfidfVector(stringvec, md):
    for item in stringvec:
        md[item]+=1
    out=md.values()
    for item in stringvec:
        md[item]=0
    return out

for stringvec in stringvecs:
    print tfidfVector(stringvec, md)

Note: md.values() should be stable as long as we aren't adding keys..

qwwqwwq
  • 6,999
  • 2
  • 26
  • 49
  • I like the thought, but it's much slower than the current implementation. (Takes about twice as long) – Slater Victoroff May 23 '13 at 15:18
  • Yeah i just checked it out, using `OrderedDict` makes it 10x slower than using `dict` and is not necessary, testing without `OrderedDict` this is close Thomas Jung's answer but still a bit slower. – qwwqwwq May 23 '13 at 15:19