I have written a piece of code that compares 2 lists (list A and list B) of strings and gives the closest matching that corresponds to one string from list A to list B. I am asking this question because the code has been run for over 8 hours now with no sign of finishing. It is essential to mention that I am using Levenshtein distance from the library fuzzywuzzy. The number of elements in both lists are around 50000. Furthermore, I don't know much about time complexity since I'm a freshman. any help would be appreciated. link to code: https://github.com/ammadakram/Environmetrics-project/blob/main/levenshtein.py
Asked
Active
Viewed 47 times
0
-
Would it be possible to test the code with smaller lists first? You should also have a sense of how your code scales with the size `N` of your datasets. This should be knowable by perhaps carefully reading the documentation for `fuzzywuzzy` or reading about the algorithmic details of how it works. – sodiumnitrate Jul 23 '22 at 19:13
-
i did test with smaller lists. in my tests list A was 1000 elements and string B was 3000. it took around 2 mins max – amad akram Jul 23 '22 at 19:16
-
What you could also do is make your script verbose as it processes the lists so that you know where it's at (the easiest way would be printing the index). Looking at your code, it seems like `get_matches()` is the function that does the core job and gets called iteratively, you can time it [like this](https://stackoverflow.com/questions/7370801/how-do-i-measure-elapsed-time-in-python?noredirect=1&lq=1). To speed up the process, check out [this post](https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column)! – AgentBilly Jul 23 '22 at 19:45
-
Does this answer your question? [How do I measure elapsed time in Python?](https://stackoverflow.com/questions/7370801/how-do-i-measure-elapsed-time-in-python) – itprorh66 Jul 23 '22 at 19:54
-
You should probably replace the usage of fuzzywuzzy with [RapidFuzz](https://github.com/maxbachmann/RapidFuzz) when working with large datasets (just replace the import with `from rapidfuzz import process`, which should significantly improve your runtime. – maxbachmann Jul 24 '22 at 23:49
-
@maxbachmann yes that definitely helped a lot. thank you so much. – amad akram Jul 25 '22 at 20:02