0

I have a set of sentences in a file ( say 500 ). I am trying to find whether a pair of words ( say word1 and word2 ) is present in any of the sentences. I have 58000 such pairs of words.

For example, let the set of sentences be:

I am a good boy. He is a bad boy. I am a very good boy.

Pair of words to search:

am, good

So this should return the first and last sentence as output.

I am using the following regex:

for match in re.finditer(r'([ A-Za-z0-9]*)\b{string1}\b([^\.!?]*)\b{string2}\b([^\.!?]*[\.!?])'.format(string1=word1, string2=word2), sentence_set.lower(), re.S):

This statement is doing the work but taking a lot of time; more than 8 minutes.

Then I removed the regex part and used multiple loops and split each sentence, then checked whether the 2 words are present or not. This took much less time, less than 2 minutes.

So, I felt that regex is very slow at sometimes. Is that true ? Is there any way to improve the speed ?

user
  • 5,370
  • 8
  • 47
  • 75
AKA
  • 5,479
  • 4
  • 22
  • 36
  • The complexity of your pattern already speaks volumes. Plus the patterns are compiled on the fly rather than pre-compiled – Moses Koledoye Jul 18 '16 at 09:27
  • 8
    This is not a Python-specific problem; you have a [catastrophic backtracking problem](http://www.regular-expressions.info/catastrophic.html) in your regex; limit those greedy quantifiers! – Martijn Pieters Jul 18 '16 at 09:28
  • @MosesKoledoye My logic is to search for the pair of words between all sentences. Is there any other easier pattern for that ? – AKA Jul 18 '16 at 09:33
  • 1
    *"the patterns are compiled on the fly rather than pre-compiled"*- not entirely true, Python caches recently used patterns per e.g. the note on https://docs.python.org/2/library/re.html#re.compile – jonrsharpe Jul 18 '16 at 09:35
  • @jonrsharpe The insertions with `.format` does not update the pattern? I assumed OP is changing those with an outer loop. – Moses Koledoye Jul 18 '16 at 09:38
  • Yes, they do, but if the pattern matches an existing one it is looked up rather than being compiled from scratch. – jonrsharpe Jul 18 '16 at 09:38
  • Can you please provide a string that takes so much time? Try also `((?:(?!\bam\b).)*)\bam\b((?:(?!\bgood\b)[^.!?])*)\bgood\b([^.!?]*[.!?])` – Wiktor Stribiżew Jul 18 '16 at 09:50
  • Or maybe [`([^a.?!]*(?:(?:\Ba|a(?!m\b))[^a.?!]*)*)\bam\b([^g?!.]*(?:(?:\Bg|g(?!ood\b))[^g.?!]*)*)\bgood\b([^.!?]*[.!?])`](https://regex101.com/r/iH9yE9/1) – Wiktor Stribiżew Jul 18 '16 at 10:29
  • @WiktorStribiżew I want to search almost 58000 pairs of words in 500 sentences! That makes too much time. – AKA Jul 18 '16 at 10:32
  • Split out sentences, and search. – Wiktor Stribiżew Jul 18 '16 at 10:33
  • Run each of the codes (the regex based and the `for ... in ...` based with a profiler (e.g, cProfile module or line_profiler) and you'd know which is faster on _your data_. – boardrider Jul 18 '16 at 14:25
  • If you don't need to extract the sentences that match--or if you split it first, as @Wiktor suggested--you can use a much simpler regex: `\b{string1}\b[^.!?]*\b{string2}\b`. But I really think you need a whole different approach, like indexing the text first, or using [Duncan's approach](http://stackoverflow.com/a/38435677/20938). – Alan Moore Jul 18 '16 at 17:23

2 Answers2

4

You say you have 500 sentences and 58000 word pairs which means you intend to create 58000 different regular expressions to run against the sentences and most of the searches will match nothing.

A better approach by far would be to create a dict mapping each word that occurs in a word pair to a set of all the other words it can pair with.

Then for each take each sentence in turn, split it up into words, test each word in turn for membership of the dictionary and if it is found get the intersection of the other words in the sentence with the set you created of words that pair with it.

Duncan
  • 92,073
  • 11
  • 122
  • 156
4

You've got to keep in mind that the better way to do a thing is to use the right tool. Regex are good for (complex) pattern matching where you can't use method like word1 in sentence because you're looking for a pattern and not a finite string.

Some will says that regex is faster, others will says that string operations are faster. They'll both be right and wrong.

Here's a graph in favor of string manipulation:

enter image description here

And here's a question on SO in favor of regex: Which one is faster? Regex or EndsWith?


You are trying to find a word in a sentence, do no over-complicate (even if you find regex sexy), use in. Keep in mind the KISS principle and that if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.

Community
  • 1
  • 1
Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142