-2

I am a programming newbie and I was playing with some big data (yelp dataset 6m+ reviews) and for a simple example I wanted to try to find a certain word in a text so basically i tried to for loop through all this data and find a certain word in these reviews and before getting triggered I know that this is the worst way of doing that but then I used nltk to preprocess the data, put the reviews in a list and check for the word inside this list using the "in" keyword and it was much faster so my question is what makes the "in" keyword faster ? And is there an even faster way other than improving the preprocessing part ?

Edit 1(here is an example code):

First I tokenize the review e.x "This place is good" becomes ["This","place","is","good"]

contents = word_tokenize(data[u'text'])

Then I check if a certain string is in this list

if(contents[i] in list_of_targeted_words): return 1

This appeared to be faster than using for loop

if(contents[i] == list_of_targeted_words[j]): return 1
alvas
  • 115,346
  • 109
  • 446
  • 738
Essam Gouda
  • 125
  • 1
  • 12
  • Welcome to Stack Overflow. In order to help you understand how some code works, we need to first see the code. Please show us a short example that illustrates what you are asking about and we can help from there. – Code-Apprentice Mar 06 '19 at 21:27
  • Related: https://stackoverflow.com/questions/19775692/use-and-meaning-of-in-in-an-if-statement Note that as this link explains, `in` works differently for different types of data. – Code-Apprentice Mar 06 '19 at 21:28
  • 2
    It's not searching within the reviews but it's comparing the word with the whole reviews that are stored in the list. The internal workings of `a in b` depend very much on `type(b)`. For your list it comes down to roughly `any(a == x for x in b)`. – a_guest Mar 06 '19 at 21:28
  • @Code-Apprentice I added a sample code – Essam Gouda Mar 06 '19 at 21:42

1 Answers1

4

in itself is not faster; it's just the public face of the __contains__ method that a class can define. list.__contains__ is an O(n) operation because it has to walk through the entire list looking for a match.

# a in my_list
for x in my_list:
    if x == a:
        return True
return False

dict.__contains__ is fast because it just needs to perform the O(1) lookup in a dict value.

# a in my_dict
try:
     my_dict[a]
     return True
 except KeyError:
     return False

Other classes can define __contains__ as needed. Consider a class representing a binary search tree:

class BST:
    # You should be able to infer enough of the structure of the tree
    # from this definition.
    def __contains__(self, x):
        node = self.root
        while node is not None:
            if node.value == x:
                return True
            elif node.value < x:
                node = node.right
            else:
                node = node.left
        return False

Most importantly, it doesn't do an O(n) traversal of the entire tree looking for x; instead, it does a O(lg n) walk down from the root.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • I think I can try the dict._contains_ method and check the differences, thank you for your clarification – Essam Gouda Mar 06 '19 at 21:46
  • What do you mean "try"? Either you have a `dict` or you don't; you can't apply `dict.__contains__` to an object that isn't a `dict`. – chepner Mar 06 '19 at 21:49
  • Well I am thinking to create the dictionary using the reviews first instead of creating a list – Essam Gouda Mar 06 '19 at 21:53
  • Note I am registering this as an answer because when I used a dictionary to search instead a list the code is much faster now thanks for the help ! If anyone is having any similar problem please let me know I will share the code with you. – Essam Gouda Mar 12 '19 at 10:07