6

I have Python application.

There is list of 450 prohibited phrases. There is message got from user. I want to check, does this message contain any of this prohibited pharases. What is the fastest way to do that?

Currently I have this code:

message = "sometext"
lista = ["a","b","c"]

isContaining = false

for a, member in enumerate(lista):
 if message.contains(lista[a]):
  isContaining = true
  break

Is there any faster way to do that? I need to handle message (max 500 chars) in less than 1 second.

TN888
  • 7,659
  • 9
  • 48
  • 84
  • `isContaining = any(x in message for x in lista)` – falsetru Jan 05 '15 at 14:25
  • 1
    First of all, remove the `enumerate` and `a,` part. Replace `lista[a]` with `member`. Second, is the `in` function not faster? So something like `if member in message:` – Mathias711 Jan 05 '15 at 14:25
  • 1
    Membership testing is faster on [a `set`](https://docs.python.org/2/library/stdtypes.html#set-types-set-frozenset). – jonrsharpe Jan 05 '15 at 14:26
  • I think you would be better to compile a regex and test the text on it – shevski Jan 05 '15 at 14:26
  • 1
    1 second is a long time, does this really take longer? – RemcoGerlich Jan 05 '15 at 14:27
  • @RemcoGerlich I am not sure. Only telling you I should really low answer-time – TN888 Jan 05 '15 at 14:28
  • @shevski: You think using a giant regex would be faster than using a set? Really? – PM 2Ring Jan 05 '15 at 14:29
  • @PM2Ring: you'd need a set that contains all possible substrings of message. – RemcoGerlich Jan 05 '15 at 14:31
  • @RemcoGerlich: Good point. I guess you could have a set of all the significant words in the prohibited phrases, test the message to see if it has any of those significant words and only then do more thorough testing to see if it has the full prohibited phrases. – PM 2Ring Jan 05 '15 at 14:36
  • @PM2Ring: sounds like more work than testing whether it has any of the phrases. Anyway a compiled regex for this should blow any Python for loop out of the water. – RemcoGerlich Jan 05 '15 at 14:37
  • @RemcoGerlich: Perhaps (I just modified my comment, BTW). I'm assuming that only a small percentage of the messages use prohibited words / phrases, so quickly eliminating thorough testing of each message ought to speed things up. I guess. – PM 2Ring Jan 05 '15 at 14:40

4 Answers4

9

There is the any built-in function specially for that:

>>> message = "sometext"
>>> lista = ["a","b","c"]
>>> any(a in message for a in lista)
False
>>> lista = ["a","b","e"]
>>> any(a in message for a in lista)
True

Alternatively you could check the intersection of the sets:

>>> lista = ["a","b","c"]
>>> set(message) & set(lista)
set([])
>>> lista = ["a","b","e"]
>>> set(message) & set(lista)
set(['e'])
>>> set(['test','sentence'])&set(['this','is','my','sentence'])
set(['sentence'])

But you won't be able to check for subwords:

>>> set(['test','sentence'])&set(['this is my sentence'])
fredtantini
  • 15,966
  • 8
  • 49
  • 55
3

Using regex compile from list

Consider memory and building time or expression, compile in advance.

lista = [...]
lista_escaped = [re.escape(item) for item in lista]
bad_match = re.compile('|'.join(lista_escaped))
is_bad = bad_match.search(message, re.IGNORECASE)
Community
  • 1
  • 1
shevski
  • 1,012
  • 1
  • 10
  • 32
  • 3
    Combine with re.escape() for fixed string searches. re.compile will create a neatly optimized state machine for scanning strings. – Yann Vernier Jan 05 '15 at 14:33
1

I would combine the any builtin with the in operator:

isContaining = any(a in message for a in lista)

I don't know if this is the fastest way but it seems the simplest to me.

Elmar Peise
  • 14,014
  • 3
  • 21
  • 40
0

We can also use set intersection method

>>> message = "sometext"
>>> lista = ["a","b","c"]
>>> isContaining = False
>>> if set(list(message)).intersection(set(lista)):
...    isContaining = True
... 
>>> isContaining
False
>>> message = "sometext a"
>>> list(message)
['s', 'o', 'm', 'e', 't', 'e', 'x', 't', ' ', 'a']
>>> if set(list(message)).intersection(set(lista)):
...    isContaining = True
... 
>>> isContaining
True
Vivek Sable
  • 9,938
  • 3
  • 40
  • 56