1

I'm working on a continuously learning focused web crawler to find news articles related to specific crisis and tragedy events that happen around the world. I am currently working on making the data model as lean and efficient as possible considering its constant growth as the crawl continues.

I am storing the data model in a list (to do TFIDF comparisons to the page being crawled) and I want to reduce the size of the list but not lose the relative counts of each item in the list.

This is a sample model from 2 crawled webpages:

[[u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'crime', u'terror', u'crime', u'inform', u'servic', u'inform', u'laboratori', u'servic', u'want', u'want', u'want', u'terror', u'crime', u'want', u'news', u'news', u'press', u'news', u'servic', u'crime', u'inform', u'servic', u'laboratori', u'servic', u'servic', u'crime', u'crime', u'crime', u'terror', u'boston', u'press', u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'remark', u'special', u'agent', u'richard', u'deslauri', u'press', u'investig', u'boston', u'special', u'agent', u'remark', u'richard', u'deslauri', u'boston', u'investig', u'time', u'time', u'investig', u'boston', u'terror', u'law', u'enforc', u'boston', u'polic', u'polic', u'alreadi', u'alreadi', u'law', u'enforc', u'around', u'evid', u'boston', u'polic', u'evid', u'laboratori', u'evid', u'laboratori', u'may', u'alreadi', u'laboratori', u'investig', u'boston', u'polic', u'law', u'enforc', u'investig', u'around', u'alreadi', u'around', u'investig', u'law', u'enforc', u'evid', u'may', u'time', u'may', u'may', u'investig', u'may', u'around', u'time', u'investig', u'investig', u'boston', u'boston', u'news', u'press', u'boston', u'want', u'boston', u'want', u'news', u'servic', u'inform'], [u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'crime', u'crime', u'inform', u'servic', u'inform', u'servic', u'nation', u'want', u'ten', u'want', u'want', u'crime', u'want', u'news', u'news', u'press', u'releas', u'news', u'stori', u'servic', u'crime', u'inform', u'servic', u'servic', u'servic', u'crime', u'crime', u'crime', u'news', u'press', u'press', u'releas', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'2013', u'nation', u'press', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'tamerlan', u'tsarnaev', u'dzhokhar', u'tsarnaev', u'2011', u'foreign', u'govern', u'inform', u'tamerlan', u'tsarnaev', u'inform', u'2011', u'govern', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'foreign', u'govern', u'2011', u'inform', u'foreign', u'govern', u'nation', u'press', u'releas', u'crime', u'releas', u'ten', u'news', u'stori', u'2013', u'ten', u'news', u'stori', u'2013', u'ten', u'news', u'stori', u'2013', u'2011', u'request', u'inform', u'tamerlan', u'tsarnaev', u'foreign', u'govern', u'nation', u'press', u'releas', u'want', u'news', u'servic', u'inform', u'govern']]

I want to maintain the list of words and not embed the count into the list itself. I would like the list to go from:

[Boston, Boston,Boston,Bombings,Bombings,Tsarnaev,Tsarnaev,Time] to [Boston,Boston,Bombings,Tsarnaev]

Basically, if I had a list [a,a,a,b,b,c], I would want to reduce it to [a,a,b]

EDIT: Sorry for not being clear, but I will try again. I do not want a set. The number of occurrences is very important because it is a weighted list so "Boston" should appear more times than "time" or another similar term. What I am trying to accomplish is to minimize the data model while removing the insignificant terms from the model. So in the above example, I purposely left out C because it adds to much "fat" to the model. I want to keep the relativity in that A appeared 1 more time than B and 2 more times than C but since C only appeared once in the original model, it is being removed from the lean model.

wilco
  • 406
  • 4
  • 18
  • 3
    what? In your example, you are losing `c` ? I dont get the logic – karthikr May 14 '13 at 21:44
  • 1
    Why not just store the item and the actual count? `((a,3), (b,2), (c, 1))` – Marcin May 14 '13 at 21:46
  • 1
    Have a look at http://docs.python.org/2/library/collections.html#collections.Counter – Nisan.H May 14 '13 at 21:48
  • I think it was a typo (the `c` missing) – Youcha May 14 '13 at 21:50
  • I purposely left out the C in the above example. The logic is: find value with the least # of occurrences = 'x', reduce 'x' to 0 or 1, and reduce all other values (count = count - 'x') – wilco May 15 '13 at 02:37
  • Reduce the count of everything else by the count of the least-occurring element? So `[a, a, a, a, b, b, b, c, c] -> [a, a, b]`? Under what circumstances does x go to 1 rather than 0? – jscs May 15 '13 at 02:46
  • If the number of occurrences is above a certain threshold, say X, I want to deem it significant enough to keep so then I would like it to go to 1 rather than 0 – wilco May 15 '13 at 02:58
  • And then the counts of everything else are reduced by x-1 rather than x? So if X is 1, `[a, a, a, a, a, b, b, b, b, c, c, c, d, d] -> [a, a, a, a, b, b, b, c, c, d]`? – jscs May 15 '13 at 03:02
  • @JoshCaswell - precisely – wilco May 15 '13 at 04:16

4 Answers4

3
from collections import defaultdict
d = defaultdict(int)
for w in words[0]:
    d[w] += 1
mmin = min(d[p] for p in d)

then you can subtract this mmin from each word and create a new list. But perhaps the dict is compact enough. To preserve the order, you can use the info from the dict and devise some smart way to filter your initial word list.

For example, for the word list [a,a,a,b,b,c], the dictionary will contain {a:3, b:2, c:1} and the mmin=1. You can use this information to have a slimmer dictionary by subtracting 1 from all items to get {a:2, b:1} and since c is 0 it is removed.

Complete code:

from collections import defaultdict
d = defaultdict(int)
words = ['a','a','a','b','b','c']
for w in words:
    d[w] += 1
mmin = min(d[p] for p in d)
slim=[]
for w in words:
    if d[w] > mmin:
        slim.append(w)
        d[w] -= 1
print slim
perreal
  • 94,503
  • 21
  • 155
  • 181
2

If your sample model is assigned to a variable topics then you can use a collections.Counter to maintain a dictionary-like object of all topics and their counts:

from collections import Counter
topic_count = [Counter(topic) for topic in topics]
# [Counter({u'boston': 11, u'investig': 11, u'crime': 7, u'servic': 7, u'want': 6, u'press': 6, u'laboratori': 5, u'may': 5, u'news': 5, u'agent': 4, u'alreadi': 4, u'deslauri': 4, u'special': 4, u'richard': 4, u'polic': 4, u'terror': 4, u'around': 4, u'evid': 4, u'law': 4, u'remark': 4, u'inform': 4, u'enforc': 4, u'time': 4}),
#  Counter({u'tsarnaev': 13, u'inform': 12, u'govern': 9, u'tamerlan': 9, u'foreign': 8, u'news': 8, u'crime': 8, u'2011': 7, u'servic': 7, u'press': 6, u'releas': 5, u'want': 5, u'ten': 4, u'request': 4, u'stori': 4, u'nation': 4, u'2013': 4, u'dzhokhar': 4})]
mVChr
  • 49,587
  • 11
  • 107
  • 104
  • I want to maintain the list of words and not embed the count into the list itself. I would like the list to go from [Boston, Boston,Boston,Bombings,Bombings,Tsarnaev,Tsarnaev,Time] to [Boston,Boston,Bombings,Tsarnaev] – wilco May 15 '13 at 02:40
1

This seems like a "normalization" (rather than "reduction") task to me, although I'm not certain that's exactly the correct term.

I think collections.Counter is indeed what you want to use here. It has a couple of handy methods that make changing the number of items and getting the results very easy.

An instance can be created directly from a list, counting the occurences of each item. Counter.most_common() gives a list of the key/count pairs, sorted from greatest frequency to least. Then the lowest count is the second field of the last tuple in that list.

Counter.subtract() is the linchpin here: passed a list with the same key elements as the existing Counter instance, it reduces the count of each key by the number of times it appears in the new list. To create this list, use a list comprehension to get each key a number of times equal to the count of the least frequent key (adjusting for your requirement that if that count is over a certain threshold, the final result should have one occurence of that key). The nested list comprehension is just my favorite way of flattening a list -- the repeats of the keys are initially created as their own lists.

Finally, Counter.elements() will give you a list just like the one you started with: each key appears a number of times equal to its count.

from collections import Counter

def normalize_list(L, threshold):
    cntr = Counter(L)
    least_count = cntr.most_common()[-1][1]
    if least_count > threshold:
        least_count -= 1
    cntr.subtract([item for k in cntr.keys() for item in [k] * least_count])
    return list(cntr.elements())

>>> a, b, c, d, e = 'abcde'
>>> normalize_list([a, a, a, a, a, b, b, b, b, c, c, c, d, d], 10)
['a', 'a', 'a', 'c', 'b', 'b']

>>> normalize_list(your_list, 6)
[u'laboratori', u'releas', u'want', u'want', u'want', u'want', u'want', u'want', u'want', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'crime', u'boston', u'boston', u'boston', u'boston', u'boston', u'boston', u'boston', u'2011', u'2011', u'2011', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'tsarnaev', u'investig', u'investig', u'investig', u'investig', u'investig', u'investig', u'investig', u'may', u'govern', u'govern', u'govern', u'govern', u'govern', u'press', u'press', u'press', u'press', u'press', u'press', u'press', u'press', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'news', u'tamerlan', u'tamerlan', u'tamerlan', u'tamerlan', u'tamerlan', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'servic', u'foreign', u'foreign', u'foreign', u'foreign', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform', u'inform']

This doesn't preserve the order of your original list, of course.

jscs
  • 63,694
  • 13
  • 151
  • 195
1

At the risk of oversimplifying, it seems that leanmodel = model[::2] will usually do approximately what you want (in the example, it does exactly what you want).

Edit: Now knowing that ordering is unimportant, I add: You should first sort the model. This is easy to encapsulate in Python3 or Python 2.7 -- leanmodel = sorted(model)[::2] -- and slightly less straightforward in earlier versions of Python2.

In general this is a 1d nearest-neighbour sampling problem (which is why leanmodel = model[::2] may be accurate enough.)

kampu
  • 1,391
  • 1
  • 10
  • 14
  • can you provide documentation on this notation? I am not familiar with it. You have sparked my interest with the pythonic simplicity of the statement :) – wilco May 15 '13 at 04:11
  • >>> model = [1,2,3,1,2,1] >>> lean = model[::2] >>> print lean [1, 3, 2] this is not the desired result – wilco May 15 '13 at 04:15
  • 1
    @wilco : that means that you don't want ordering preserved, which is important information you should add to the post. Just sort the list beforehand, then. (I've edited my answer with a more detailed explanation.) – kampu May 15 '13 at 06:41
  • @wilco: In response to your first comment, you want to look up [slice notation](http://stackoverflow.com/questions/509211/the-python-slice-notation). Given that you accepted the 'normalization' based solution, also consider updating the OP to reflect that you wanted a normalization rather than a resampling. – kampu May 15 '13 at 06:48