Is there a more effective approach to check string in list avoiding (slightly) modified duplicates?

Question

Slightly modified strings in list problem, when checking against a sorted list of strings.

I am checking strings representing content from some files. And I have a list of certain strings I check against, however, sometimes, the same string can have an asterisk(*) appended to the end, this resulting in slightly modified duplicates in this list.

Currently:

  # This is minimal very minimal code example: 
  for _word in sorted(['Microsoft','Microsoft*']): 
      print(_word)

Desired:

for _word in sorted(['Microsoft']):
    print(_word)

 # But still be able to check for 'Microsoft*' without having duplicates in the list.

Final solution:

import os
import sys

if __name__ == '__main__':

    default_strings = sorted([
        'microsoft',
        'linux',
        'unix',
        'android'
    ])

    text = str("""
        Microsoft* is cool, but Linux is better. 
     """)

    tokens = text.split(" ")
    for token in tokens:
        token = token.lower()
        if token.endswith('*'): 
            token = token[:-1]

        if token in default_strings:
            print(token)

EDIT: If there's a better way, please let me know. Thanks a lot to everyone that participated and responded.

Nope, sets don't work here. I have tested sets() and {} dictionary. — John Smith, Mar 29 '19 at 17:17
yea but can you filter special characters first and then cast it into set? — BetaDev, Mar 29 '19 at 17:30
Simply define your own `MySet` by extending builtin `set` class and then modify the double underscore method where the duplicate check will ignore the special characters like asterisk (*). Finally, Cast your List into MySet type. — BetaDev, Mar 29 '19 at 17:44

BetaDev · Accepted Answer · 2019-03-29T17:28:03.717

0

Based on the my comment; If you dont want to use set then,

my_filters = ['*']
my_list = ['Microsoft','Microsoft*']
final_list = [ x.replace(y,'')  for x in my_list for y in my_filters if y in x ]

But I would say first filter the special characters in your input list and simply convert (cast) it into set.

edited Mar 29 '19 at 17:28

answered Mar 29 '19 at 17:20

BetaDev

4,516
3
21
47

1

this isn't memory efficient. And performance efficient. Sorry! – John Smith Mar 29 '19 at 17:22
1

Predefined lists on runtime are faster and more efficient, based on my benchmarking unit tests. – John Smith Mar 29 '19 at 17:24
hey why dont you extend the built in `set` modify the double underscore method and create your own comparison. Now you will have your own `set` data structure and you can simply cast your list into set. I have to look at the `__` methods that we need to modify to ignore the duplicates. – BetaDev Mar 29 '19 at 17:42
1

Hi @webDev, I ended up applying direct: String Slicing in Python, after verifying if the string.endswith('*'), this approach is more effective and efficient. ;-) Thanks for the effort. Most of what you have suggested, I've already tested. I go to SO, once I what input from others who might know something I didn't think of. – John Smith Mar 29 '19 at 17:45
2

@JohnSmith in what sense is this not efficient? The solution you proposed is just as efficient, except your `default_strings` should be a `set`, otherwise both of these approaches will be quadratic time. Edit: and why sort them? – juanpa.arrivillaga Mar 29 '19 at 18:00
Ah I see what you were saying and what you wanna achieve. I thought you want to remove duplicate from a list ignoring the special characters. My bad. Haha. You have whole different requirement and whole new story there hahah. – BetaDev Mar 29 '19 at 18:01
@juanpa.arrivillaga, don't need to be a set, because the list is made with only items I am interested in, thus, no duplicates. In other words, it's a manually curated list, not a dynamically and programmatically made. – John Smith Mar 29 '19 at 18:05
yea but *currently* and *desired code* section that you have posted with original question is confusing. Currently and desired does not include your original input string which makes difficult to understand your question. – BetaDev Mar 29 '19 at 18:09
1

@JohnSmith that doesn't matter. The purpose of a set isn't to remove duplicates, people just commonly use `set` (or dict) objects to remove duplicates, but the *main* use-case is to give you fast, constant-time membership testing. Even with small lists, `set` membership tests are significantly faster. This will become increasingly impactful the larger the number of items you need to check becomes. – juanpa.arrivillaga Mar 29 '19 at 18:09
1

@juanpa.arrivillaga, I have tested to convert a list of predefined string to set() and then used timeit, and it wasn't faster than a regular predefined list of strings, that's why I decided to go for the list approach. Regarding, sort, I have always learnt that a pre-sorted list is more efficient when checking against it, vs. a non-sorted one. But in case of Python I could be a bit wrong. So please feel free to correct me. AFAIK, sets in Python usually eliminate duplicates, I know this from debugging and testing. – John Smith Mar 29 '19 at 18:13
3

@JohnSmith converting to a `set` is a one-time, upfront cost. But if you are going to be checking membership *many times* then a set is *definitely more efficient*. Also, searching a sorted list is only faster **if you use binary search**, but you aren't doing that. Just use a set, dude, **this is exactly what they are for**. Also, if you can "manually define" your list, you can just do that with a set. Leave the creation time out of your timings. Again, `set` objects *do eliminate* dupilcates, but that isn't their purpose, their *purpose* is **fast membership testing** – juanpa.arrivillaga Mar 29 '19 at 18:14
@juanpa.arrivillaga, so your suggesting, I convert my current list in my 'final solution' above to a set, but keep everything else? I have read about lists, and sets. I know they are different, dude. But, dude, when I did unit testing, the list [] loaded much faster than the set, despite them both having the same content. – John Smith Mar 29 '19 at 18:17
2

@JohnSmith https://gist.github.com/juanarrivillaga/c3bf1e727c146deb1af0357791189398 just look at the timings. Yes, a set will *initially cost more to load*, but **if you are going to be using the set for membership testing then it will be much much faster**. I don't think you understand the difference between these data-structures, this is *the main use-case of a set*. Also, any overhead out of creating a set will be minimized if you are then going to sort the list. Note, checking for membershipt in a set is *constantly* like 25 nanoseconds, it doesn't matter if it has one or 1000 items. – juanpa.arrivillaga Mar 29 '19 at 18:19
2

@JohnSmith by the time you check for an item in a list with 1000 items, on the other hand, it is costing 12 microseconds, about *500 times faster with a set*. And the beauty is *that you will keep getting slower as the list gets bigger, but it will stay the same time no matter the size of the set* – juanpa.arrivillaga Mar 29 '19 at 18:21
@juanpa.arrivillaga, this is indeed useful testing. Thanks for your input. What about this: "Sets are significantly faster when it comes to determining if an object is present in the set (as in x in s), but are slower than lists when it comes to iterating over their contents.", from https://stackoverflow.com/questions/2831212/python-sets-vs-lists – John Smith Mar 29 '19 at 18:21
2

@JohnSmith yes, `set` objects **are there for checking membership**. That is their primary purpose of being. If you need to iterate over something, a `list` will be the fastest built-in python data structure to do that. – juanpa.arrivillaga Mar 29 '19 at 18:22
@juanpa.arrivillaga Please feel free to update the answer based on your suggestions. I don't have that deeper understanding as you have. – BetaDev Mar 29 '19 at 18:25

Is there a more effective approach to check string in list avoiding (slightly) modified duplicates?

1 Answers1