0

What is the most efficient way to compare two lists and only keep the elements that are in list A but not B for very large datasets?

Example:

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

There have been somewhat similar questions but I'm currently dealing with 1M+ words and filters, leading to Regular Expressions overloads. I used to do a simple 'filters[i] in words[j]' test with while-loops, but this seems awfully inefficient.

Community
  • 1
  • 1
oliver13
  • 996
  • 2
  • 7
  • 19
  • Use a `set()`, the lookups will be faster. – Burhan Khalid Feb 05 '14 at 08:01
  • You say it *seems* inefficient, but *is* it? If it is, you can avoid matching regexps for *all* entries by using a hash function - only compare the entries the hashes of which match. – Michael Foukarakis Feb 05 '14 at 08:03
  • @BurhanKhalid: Set does not check if strings contain, they just to exact matches, see example – oliver13 Feb 05 '14 at 08:04
  • @mfukar: In this case I wouldn't even need regex - ' '+filter+' ' in ' '+word+' ' would do. Still, this takes forever but if it's the most efficient I will have to deal with it :) – oliver13 Feb 05 '14 at 08:06
  • 1
    Oh, I'm sorry, I thought you meant complete matches. My comment obviously doesn't apply. – Michael Foukarakis Feb 05 '14 at 08:07
  • I would use precompiled regexs for filters. Maybe you can post your code and we can try to find your bottleneck. – Jiri Feb 05 '14 at 08:12

2 Answers2

2

You can make filters a set

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']

This works better than your filters[i] in words[j] because it won't filter "smoothies" if "smooth" is in the filter list

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • Interesting approach, although your concern should be avoided by including spaces in filters and words (' '+word+' '). I will try which one is faster! Thanks! – oliver13 Feb 05 '14 at 09:42
2

I tried slightly modified @gnibbler version: it is using set operation intersection instead of list comprehension. I believe that this version is a bit faster.

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']
Jiri
  • 16,425
  • 6
  • 52
  • 68
  • Do you have a benchmark for the set intersection actually being faster? I would assume that the set construction is a bit expensive (hashes need to be calculated). – Jonas Schäfer Feb 05 '14 at 08:26
  • I just tested both versions by timeit.Timer(test).timeit(), my run a bit faster 3 times, but that should be tested on large dataset ... :( – Jiri Feb 05 '14 at 08:28
  • My idea was that list comprehension/iteration is slower than library function ... but yeah, you are correct too. – Jiri Feb 05 '14 at 08:30
  • 1
    I tested it against [the english San Francisco wikipedia article](https://en.wikipedia.org/wiki/San_Francisco), using the sentences (``words=f.read().lower().split('.')``). ``filters={'city', 'bay'}``. The set version runs in 0.007 seconds per run and the list version in 0.009 seconds per run. [My test source](http://fpaste.org/74566/). Interestingly, with python2, the code is faster, but the relative difference is approximately (1:1.19 vs. 1:1.25) the same. – Jonas Schäfer Feb 05 '14 at 08:34