How to obtain a list of documents which contain all words from a query? [not intersection]

Question

I'm working using Python.

I have a query, with words. For example, query=[hello, tree, blue]

I've selected for each word in which documents it is, so I have a list for each word, where each position is one of the documents. Let's say:

list_query[0]=[1,4,5]
list_query[1]=[5,8]
list_query[2]=[4,5,8]

So, I should get a result = [5]

But, I don't wanna do it using intersection. I need to do it using iterations, i, j.

hello:
      i
      |
      1    4    5
tree:
      5    8
      |
      j

I'll have to start with i=0, compare if the list_query[0][i]==list_query[1][j], if so add that number to the list. If not I should iterate the smaller number of boths iterators, and so on, with the result of that intersection of those lists and the rest of the elements of the query. But kind find how to do it and it's driving me mad.

So if anyone could help me... Thanks in advance.

I see. [This thread](http://stackoverflow.com/questions/497338/efficient-set-intersection-algorithm) appears to be helpful. — georg, Oct 27 '12 at 15:31
`[i for i in listquery[0] if (i in listquery[1]) and (i in listquery[2])]` — Joel Cornett, Oct 27 '12 at 15:41

score 0 · Answer 1 · answered Oct 27 '12 at 15:45

Iteratively intersecting the contents of your lists isn't too hard if you're most interested in code simplicity, rather than performance:

def intersect(query_lists):
    # initialize to first result set
    combined_results = query_lists[0]

    # filter out values missing in any other result set
    for query in query_lists[1:]:
        combined_results = filter(lambda i: i in query, combined_results)

    # turn nested generators into a list
    return list(combined_results)

This would be much faster using set instances rather than lists, but if you were using them you could just use the built in intersection methods and not bother doing it manually.

You could also achieve almost the same speedup by combining your lists, sorting the combined result and then scanning to find values that are duplicated exactly as many times as you had original lists. This won't work if there can ever be duplicates in your input sets though.

If you know that your input lists are each sorted, you can check to see if their first values are all identical, and reject any that are smaller than the largest one. It's probably not worth doing if your sub-lists are not sorted though, since you'll lose some of the performance benefits if you sort each sublist yourself.

score 0 · Accepted Answer · answered Oct 27 '12 at 16:00

I feel that you're already pretty far along. I could show you an implementation, but you've described the algorithm already, so implementing it yourself shouldn't be hard. But perhaps you don't feel confident in your description.

Let me restate your description in a slow and precise way, leaving out the information we don't need about queries and such. We have two pre-sorted lists, and we want to find their intersection. Adapting your diagram with a slightly fuller example, starting with list a = [1, 4, 5, 7, 8], b = [5, 8, 9], i=0, and j=0, as well as an empty output list out = [] which I'll leave out of the diagram initially...

i = 0
a = 1 4 5 7 8

j = 0
b = 5 8 9

First we check if they're equal. They aren't, so we take the minimum of a[i] and b[j]. In this case, a[i] == 1 and b[j] == 5, so we want to increment i.

i =   1
a = 1 4 5 7 8

j = 0
b = 5 8 9

Going through the same steps, we increment i again:

i =     2
a = 1 4 5 7 8

j = 0
b = 5 8 9

Now things go differently; a[i] and b[j] are the same, so we want to append that value to the output list and increment both values:

i =       3
a = 1 4 5 7 8

j =   1
b = 5 8 9

out = 5

Now we continue. a[i] is again less than b[j]...

i =         4
a = 1 4 5 7 8

j =   1
b = 5 8 9

And the values are the same, so we add that value to out and increment i and j...

i =           5
a = 1 4 5 7 8

j =     2
b = 5 8 9

out = 5 8

But now we find that i == len(a). So we know the algorithm has terminated.

Now we have all we need to establish what variables we need and how the logic should work. We need list a, list b, index i, index j, and list out. We want to create a loop that stops when either i == len(a) or j == len(b), and within that loop, we want to test a[i] for equality with b[j]. If they are equal, we increment both i and j and append a[i] to out. If they are not equal, then we test whether a[i] < b[j]. If it is, then we increment i; otherwise, we increment j.

This determines the intersection between two lists; now we just have to apply this to the first and the second list, and then apply the result to the third list, and then apply the result of that to the fourth, and so on.

@LaFeeVerte, if you need more information, let me know. It just seems to me that you'll learn more if you can figure out the details on your own -- but I'm happy to answer additional questions. — senderle, Oct 28 '12 at 02:22

How to obtain a list of documents which contain all words from a query? [not intersection]

2 Answers2