2

So if I have a matrix (list of lists) where each column represents a unique word, each row represents a distinct document, and every entry is a 1 or 0, indicating whether or not the word for a given column exists in the document for a given row.

What I'd like to know is how to determine all the possible combinations of words and documents where more than one word is in common with more than one document. The result might look something like:

[ [[Docid_3, Docid_5], ['word1', 'word17', 'word23']], 
  [[Docid_3, Docid_9, Docid_334], ['word2', 'word7', 'word23', 'word68', 'word982']],
  ...

and so on for each possible combination. Would love a solution that provides the complete set of combinations and one that yields only the combinations that are not a subset of another, so from the example, not [[Docid_3, Docid_5], ['word1', 'word17']] since it's a complete subset of the first example.

I feel like there is an elegant solution that just isn't coming to mind and the beer isn't helping.

Thanks.

outis
  • 75,655
  • 22
  • 151
  • 221
ssweens
  • 21
  • 1
  • Your data structure is unclear to me. Do you really mean you have a 2D matrix of column headings? That would make the data structure itself 3D. Please show an example of the input data that we can use to run code on. – Dave Kirby Jun 12 '10 at 08:55
  • Your structure there is `[[], []], [[], []]`, which is _two_ lists of lists, that is an (implicit) tuple of lists of lists. Is that what you want? – badp Jun 12 '10 at 09:21
  • The data structure is flexible. It's just words (stemmed, lowered, etc) that map to the documents they came from. I landed on the matrix format described above as I was thinking some sort of binary matrix math might be the most efficient manner to solve it, but Alex's dicts of [doc_id] = set(word1, word2, ...., wordN) is fine if it provides the end result. – ssweens Jun 13 '10 at 22:18

4 Answers4

3
  1. Normalize the text. You only want strings made of string.lowercase. Split/strip on everything else.
  2. Make sets out of this.
  3. Use something like this to get all possible groupings of all sizes:

    def get_all_lengths_combinations_of(elements):
      for no_of_items in range(2, len(elements)+1):
        for items in itertools.combinations(elements, no_of_items)
            yield items    
    

    I'm sure the real itertools wizards will come up with something better, possibly involving izip().

  4. Remember you should be able to use the set.intersection() method like this:

    set.intersection(*list_of_sets_to_intersect)
    
jfs
  • 399,953
  • 195
  • 994
  • 1,670
badp
  • 11,409
  • 3
  • 61
  • 89
  • ...duh :D Thank you so much @Sebastian, I was fitting so well in my complicator's gloves :D – badp Jun 12 '10 at 13:21
  • Thanks bp. Started to land on a similar solution while waiting for answers, but this is nice and clean. I found that this starts to get unruly fairly fast after doc counts get past 10's and word matches get past 2 or 3 unfortunately and just crushes my poor little VM. – ssweens Jun 13 '10 at 22:23
  • @ssweens: Well, to name a number `len(list(get_all_lengths_combinations_of(range(10))))` is 1013. What you can do to help yourself is to use those results to save on computation. For example, once you calculate X = A ∩ B, you can calculate A ∩ B ∩ C as X ∩ C. – badp Jun 14 '10 at 06:30
  • Thanks bp. Appreciate the assist. – ssweens Jun 16 '10 at 16:36
3

First, build a mapping from document ID to set of words -- your matrix of 0 and 1 is quite an unwieldy structure to process directly. If I read you correctly, the "column headings" (words) are the first list in the matrix (minus presumably the first item) and the "row headings" (docids) are the first items of each row (minus presumably the first row). Then (assuming Python 2.6 or better):

def makemap(matrix):
  im = iter(matrix)
  words = next(im)[1:]
  themap = {}
  for row in im:
    mapent = set()
    docid = row[0]
    for w, bit in zip(words, row[1:]):
      try:
        if bit: mapent.add(w)
      except:
        print 'w is %r' % (w,)
        raise
    themap[docid] = mapent
  return themap

Now you need to check all feasible subsets of documents -- the total number of subsets is huge so you really want to prune that search tree as much as you can, and brute-force generation of all subsets (e.g. by looping on itertools.combinations for various lengths) will not perform any pruning of course.

I would start with all 2-combinations (all pairs of docids -- itertools.combinations is fine for this of course) and make the first batch (those pairs which have 2+ words in commons) of "feasible 2-length subsets". That can go in another mapping with tuples or frozensets of docids as the keys.

Then, to make the feasible (N+1)-length subsets, I would only try to extend existing feasible N-length subsets by one more docid each (checking the total intersection is still 2+ long of course). This at least does some pruning rather than blindly trying all the 2**N subsets (or even just the 2**N - N - 1 subsets of length at least two;-).

It might perhaps be possible to do even better by recording all docids that proved unable to extend a certain N-length subset -- no point in trying those against any of the (N+1)-length subsets derived from it. This is worth trying as a second level of pruning/optimization.

There may be further tweaks yet you might do for further pruning but offhand none immediately comes to mind, so that's where I'd start from. (For readability, I'm not bothering below with using microoptimizations such as iteritems in lieu of items, frozensets in lieu of tuples, etc -- they're probably marginal given those sequences are all O(N) vs the exponential size of computed structures, although of course worth trying in the tuning/optimizing phase).

def allfeasiblesubsets(matrix):
  mapping = makemap(matrix)
  docids = sorted(mapping.keys())
  feasible_len2 = {}
  dont_bother = dict((d, set([d])) for d in docids)
  for d1, d2 in itertools.combinations(docids, 2):
    commonw = mapping[d1].intersection(mapping[d2])
    if len(commonw) >= 2:
      feasible_len2[d1, d2] = commonw
    else:
      dont_bother[d1].add(d2)
      dont_bother[d2].add(d1)
  all_feasible = [feasible_len2]
  while all_feasible[-1]:
    feasible_Np1 = {}
    for ds, ws in all_feasible[-1].items():  
      md = max(ds)        
      for d, w in mapping.items():
        if d <= md or any(d in dont_bother[d1] for d1 in ds):
          continue
        commonw = w.intersection(ws)
        if len(commonw) >= 2:
          feasible_Np1[ds+(d,)] = commonw
    all_feasible.append(feasible_Np1)
  return all_feasible[:-1]

You'll notice I've applied only a mild form of my suggested "further pruning" -- dont_bother only records "incompatibilities" (<2 words in common) between one docid and others -- this may help if there are several pairs of such "incompatible docids", and is simple and reasonably unobtrusive, but is not as powerful in pruning as the harder "full" alternative. I'm also trying to keep all keys in the feasible* dicts as sorted tuples of docids (as the itertools.combinations originally provides for the pairs) to avoid duplications and therefore redundant work.

Here's the small example I've tried (in the same file as these functions after, of course, the import for itertools and collections):

mat = [ ['doc']+'tanto va la gatta al lardo che ci lascia lo zampino'.split(),
        ['uno', 0, 0, 0, 1, 0, 1, 0, 0, 0, 1],
        ['due', 1, 0, 0, 0, 0, 1, 0, 1, 0, 1],
        ['tre', 1, 0, 0, 0, 0, 0, 0, 1, 0, 1],
        ['qua', 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]]

mm = makemap(mat)
print mm
afs = allfeasiblesubsets(mat)
print afs

The results, which appear OK, are:

{'qua': set(['gatta', 'lo', 'ci', 'lardo']), 'tre': set(['lo', 'ci', 'tanto']), 'due': set(['lo', 'ci', 'lardo', 'tanto']), 'uno': set(['gatta', 'lo', 'lardo'])}
[{('due', 'tre'): set(['lo', 'ci', 'tanto']), ('due', 'uno'): set(['lo', 'lardo']), ('qua', 'uno'): set(['gatta', 'lo', 'lardo']), ('due', 'qua'): set(['lo', 'ci', 'lardo']), ('qua', 'tre'): set(['lo', 'ci'])}, {('due', 'qua', 'tre'): set(['lo', 'ci']), ('due', 'qua', 'uno'): set(['lo', 'lardo'])}]

but of course there might still be bugs lurking since I haven't tested it thoroughly. BTW, I hope it's clear that the result as supplied here (a list of dicts for various increasing lengths, each dict having the ordered tuple forms of the docids-sets as keys and the sets of their common words as values) can easily be post-processed into any other form you might prefer, such as nested lists.

(Not that it matters, but the text I'm using in the example is an old Italian proverb;-).

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • This is really good. Thank you, Alex. Will definitely find use for at least some of this. I'm still tinkering with it as it seems to fall down on my 1GB vm once I throw over 100+ documents (100 words or less) at it. Rough profiling looks to get overwhelmed in the "while all_feasible[-1]:" loop, namely the "for d, w in mapping.items():" segment after 5 or 6 doc commonality. Will try some additional pruning work to get speed up or tinkering with a bitarray solution as Denis suggests. Am thinking bitwise has better chance of getting perf to a 1000+ document scale. – ssweens Jun 13 '10 at 22:13
1

Take a look at SO what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

For real problem sizes, say > 100 docs, 10000 words, get the nice bitarray module
(which says, by the way, "the same algorithm in Python ... is about 20 times slower than in C").

On "only the combinations that are not a subset of another": define a hit22 as a 2x2 submatrix with 11 11, a hit23 as a 2x3 submatrix with 111 111 (2 docs, 3 words in common), and so on.
A given hit22 may be in many hit2n s — 2 docs, n words,
and also in many hitn2 s — n docs, 2 words. Looks fun.

Added Monday 14Jun: little functions using bitarray.
(Intro / python modules for real doc classification ? Dunno.)

""" docs-words with bitarray, randombits """
# google "document classification" (tutorial | python) ...
# https://stackoverflow.com/questions/1254627/what-tried-and-true-algorithms-for-suggesting-related-articles-are-out-there

from __future__ import division
import random
import sys
from bitarray import bitarray  # http://pypi.python.org/pypi/bitarray

__date__ = "14jun 2010 denis"

ndoc = 100
nbits = 1000

exec "\n".join( sys.argv[1:] )  # run this.py ndoc= ...
random.seed(1)
me = __file__.split('/') [-1]
print "%s  ndoc=%d  nbits=%d" % (me, ndoc, nbits)

    # bitarray stuff --

def bitslist( bits ):
    """ 011001 -> [1,2,5] """
    return [ j for j in range(len(bits)) if bits[j] ]

hex_01 = {
    "0": "0000", "1": "0001", "2": "0010", "3": "0011",
    "4": "0100", "5": "0101", "6": "0110", "7": "0111",
    "8": "1000", "9": "1001", "a": "1010", "b": "1011",
    "c": "1100", "d": "1101", "e": "1110", "f": "1111",
}

def to01( x, len_ ):
    x = "%x" % x
    s = "".join( hex_01[c] for c in x )
    return (len_ - len(s)) * "0"  + s

def randombits( nbits ):
    """ -> bitarray 1/16 1, 15/16 0 """
    hibit = 1 << (nbits - 1)
    r = (random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 )
        & random.randint( 0, hibit - 1 ))  # prob 1/16
    return bitarray( to01( r, nbits ))

#...............................................................................
doc = [ randombits(nbits) for j in range(ndoc) ]  # ndoc x nbits

def mostsimilarpair():
    """ -> (sim, j, k) most similar pair of docs """
    mostsim = (-1,-1,-1)
    for j in range(ndoc):
        for k in range(j+1, ndoc):
                # allpairs[j,k] -> scipy.cluster.hier ?
            sim = (doc[j] & doc[k]).count()  # nr bits (words) in common, crude
            mostsim = max( mostsim, (sim,j,k) )
    return mostsim

sim, jdoc, kdoc = mostsimilarpair()
print "The 2 most similar docs:" ,
print "doc %d has %d words," % ( jdoc, doc[jdoc].count() ) ,
print "doc %d has %d," % ( kdoc, doc[kdoc].count() )
print "%d words in common: %s" % ( sim, bitslist( doc[jdoc] & doc[kdoc] ))
print ""

#...............................................................................
def docslike( jdoc, thresh ):
    """ -> (doc index, sim >= thresh) ... """
    for j in range(ndoc):
        if j == jdoc:  continue
        sim = (doc[j] & doc[jdoc]).count()
        if sim >= thresh:
            yield (j, sim)

thresh = sim // 2
print "Docs like %d, with >= %d words in common:" % (jdoc, thresh)
for j, sim in docslike( jdoc, thresh ):
    print "%3d  %s" % ( j, bitslist( doc[j] & doc[jdoc] ))

"""
The 2 most similar docs: doc 72 has 66 words, doc 84 has 60,
12 words in common: [11, 51, 119, 154, 162, 438, 592, 696, 800, 858, 860, 872]

Docs like 72, with >= 6 words in common:
  2  [3, 171, 258, 309, 592, 962]
...
"""
Community
  • 1
  • 1
denis
  • 21,378
  • 10
  • 65
  • 88
  • I'm following you, but get lost at the hit22, hit23... I think I skipped binary matrix math day in school... I get the concept and why 11 11, etc. but can't envision how one would take a bunch of bitarrays (one per doc) and implement a def match(hit22): function against those arrays... – ssweens Jun 13 '10 at 22:48
  • BTW - if you are truly from DE, nice thrashing of Australia today. – ssweens Jun 13 '10 at 23:11
  • Thanks a bunch, Denis. This is great stuff & helps. I grok it & am now working to factor the results down to the keyword/docid combo I describe above and will keep tinkering with what you provided. Looking to part learn/part solve a problem.. simple hier/kmeans/nnmf stuff I tried isn't cutting it. Too incl/exclusive, too much manual tweaking, etc. Looks like least common denom. words in sets of docs gives me keyword-based clusters that work good. Its just getting that list of elig. keyword sets & doc count/ids.. your hit22 approach sounds solid and cool/fun.. just don't know what to google:) – ssweens Jun 15 '10 at 17:39
  • And no, not Aussie.. US.. envious of their capacity for Foster's, but glad we have our football team. Hoping we don't end up playing Germany in the round of 16 after seeing that shellacking. – ssweens Jun 15 '10 at 17:49
0

How many documents? How many unique words? How much RAM do you have?

What do you want to produce in the following scenario: document A has words 1, 2, 3; B has 1, 2; C has 2, 3

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • Something like: [[a, b], [1, 2]], [[a,c][2, 3]] Don't care about the single doc or single words. Would like solution to handle 1000+ docs and could live with 100 words or less, averaging probably 30 unique words/doc. RAM -> whatever it takes... less the better. Can provide min of 1-2GB, no problemo. – ssweens Jun 13 '10 at 22:42