0

I have a text file with 20 million names, each name in a line. Now I want to for example find the name "peter" in that list.

  1. My first approach was with PyMongo and creating an index, which worked well. However, I wanted the fastest possible search.

  2. Since the list is static and never changes, I thought it is possible to load the list in a variable and iterate it. However compared to MongoDB it is very slow. Around one search = 1s.

     def initList(self):
       global list
       data = []
       with open('list.txt','r') as f:
         for x in f:
             x = x.strip()
             data.append(x)
    
       list = [i for i, _ in enumerate(data)]
       print("All names loaded.")
    

Then the code for the search:

  global list

  if name in list or surname in list:
    print(x)

Now my question is, am I missing something, or why is the list approach so slow? What would be the ultimate fastest solution?

My next step will be multiprocessing.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • 1
    An indexed binary search over 20M items usually takes about 25 steps, an unindexed search 10M steps on average. – Klaus D. Mar 07 '21 at 13:29
  • list membership testing is a linear search because a list can contain anything in any order, so finding the last element of the list, for example, will take forever in a long list, you need to either order the list first and then use a binary search or change it to a dict or set that by the way they are implemented offer a very fast membership testing – Copperfield Mar 07 '21 at 14:21
  • Which part of your code have your timed ? I suppose it is the part "if name in list or surname in list:" which is slow? Usually the "in list" part of the code is slow, especially with the big lists. – Malo Mar 07 '21 at 14:39
  • Where is the MongoDB data and the query? – prasad_ Mar 08 '21 at 04:57

2 Answers2

0

Your approach requires checking names one by one. This gives O(n) number of comparisons. Databases use indices so they have all names alphabetically ordered. This gives the ability to make the search much faster - O(log(n)) because they divide the list into two pieces and after comparing provided name with the middle element instantly know what half of the elements definitely do not contain it. Then they repeat this with the remaining half. This reduces the number of comparisons dramatically. Python offers dict data structure. It uses a different approach, but also very effective. You can find some details here on StackOverflow

Multiprocessing will not give a big advantage compared with database search that uses the index.

user3431635
  • 372
  • 1
  • 7
0

Isn't this line of code for indexing? list = [i for i, _ in enumerate(data)]

I thought that a solution without database should be faster, because the variable is in the ram?