Performance: Comparing two lists in python for string matches

Question

What is the most efficient way to compare two lists and only keep the elements that are in list A but not B for very large datasets?

Example:

words = ['shoe brand', 'car brand', 'smoothies for everyone', ...]
filters = ['brand', ...]
# Matching function
results = ['smoothies for everyone']

There have been somewhat similar questions but I'm currently dealing with 1M+ words and filters, leading to Regular Expressions overloads. I used to do a simple 'filters[i] in words[j]' test with while-loops, but this seems awfully inefficient.

You say it *seems* inefficient, but *is* it? If it is, you can avoid matching regexps for *all* entries by using a hash function - only compare the entries the hashes of which match. — Michael Foukarakis, Feb 05 '14 at 08:03
@BurhanKhalid: Set does not check if strings contain, they just to exact matches, see example — oliver13, Feb 05 '14 at 08:04
@mfukar: In this case I wouldn't even need regex - ' '+filter+' ' in ' '+word+' ' would do. Still, this takes forever but if it's the most efficient I will have to deal with it :) — oliver13, Feb 05 '14 at 08:06
Oh, I'm sorry, I thought you meant complete matches. My comment obviously doesn't apply. — Michael Foukarakis, Feb 05 '14 at 08:07
I would use precompiled regexs for filters. Maybe you can post your code and we can try to find your bottleneck. — Jiri, Feb 05 '14 at 08:12

score 2 · Answer 1 · answered Feb 05 '14 at 08:07

2

You can make filters a set

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if all(i not in filters for i in w.split())]
['smoothies for everyone']

This works better than your filters[i] in words[j] because it won't filter "smoothies" if "smooth" is in the filter list

answered Feb 05 '14 at 08:07

John La Rooy

295,403
53
369
502

Interesting approach, although your concern should be avoided by including spaces in filters and words (' '+word+' '). I will try which one is faster! Thanks! – oliver13 Feb 05 '14 at 09:42

score 2 · Accepted Answer · answered Feb 05 '14 at 08:23

2

I tried slightly modified @gnibbler version: it is using set operation intersection instead of list comprehension. I believe that this version is a bit faster.

>>> words = ['shoe brand', 'car brand', 'smoothies for everyone']
>>> filters = {'brand'}
>>> [w for w in words if not set(w.split()).intersection(filters)]
['smoothies for everyone']

answered Feb 05 '14 at 08:23

Jiri

16,425
6
52
68

Do you have a benchmark for the set intersection actually being faster? I would assume that the set construction is a bit expensive (hashes need to be calculated). – Jonas Schäfer Feb 05 '14 at 08:26
I just tested both versions by timeit.Timer(test).timeit(), my run a bit faster 3 times, but that should be tested on large dataset ... :( – Jiri Feb 05 '14 at 08:28
My idea was that list comprehension/iteration is slower than library function ... but yeah, you are correct too. – Jiri Feb 05 '14 at 08:30
1

I tested it against [the english San Francisco wikipedia article](https://en.wikipedia.org/wiki/San_Francisco), using the sentences (``words=f.read().lower().split('.')``). ``filters={'city', 'bay'}``. The set version runs in 0.007 seconds per run and the list version in 0.009 seconds per run. [My test source](http://fpaste.org/74566/). Interestingly, with python2, the code is faster, but the relative difference is approximately (1:1.19 vs. 1:1.25) the same. – Jonas Schäfer Feb 05 '14 at 08:34

Performance: Comparing two lists in python for string matches

2 Answers2