15

I have a list of strings:

In [53]: l = ['#Trending', '#Trending', '#TrendinG', '#Yax', '#YAX', '#Yax']

In [54]: set(l)
Out[54]: {'#TrendinG', '#Trending', '#YAX', '#Yax'}

I want to have a case-insensitive set of this list.

Expected Result:

Out[55]: {'#Trending', '#Yax'}

How can I achieve this?

Yax
  • 2,127
  • 5
  • 27
  • 53

6 Answers6

28

If you need to preserve case, you could use a dictionary instead. Case-fold the keys, then extract the values to a set:

 set({v.casefold(): v for v in l}.values())

The str.casefold() method uses the Unicode case folding rules (pdf) to normalize strings for case-insensitive comparisons. This is especially important for non-ASCII alphabets, and text with ligatures. E.g. the German ß sharp S, which is normalised to ss, or, from the same language, the s long s:

>>> print(s := 'Waſſerſchloß', s.lower(), s.casefold(), sep=" - ")
Waſſerſchloß - waſſerſchloß - wasserschloss

You can encapsulate this into a class.

If you don't care about preserving case, just use a set comprehension:

{v.casefold() for v in l}

Note that Python 2 doesn't have this method, use str.lower() in that case.

Demo:

>>> l = ['#Trending', '#Trending', '#TrendinG', '#Yax', '#YAX', '#Yax']
>>> set({v.casefold(): v for v in l}.values())
{'#Yax', '#TrendinG'}
>>> {v.lower() for v in l}
{'#trending', '#yax'}

Wrapping the first approach into a class would look like:

try:
    # Python 3
    from collections.abc import MutableSet
except ImportError:
    # Python 2
    from collections import MutableSet

class CasePreservingSet(MutableSet):
    """String set that preserves case but tests for containment by case-folded value

    E.g. 'Foo' in CasePreservingSet(['FOO']) is True. Preserves case of *last*
    inserted variant.

    """
    def __init__(self, *args):
        self._values = {}
        if len(args) > 1:
            raise TypeError(
                f"{type(self).__name__} expected at most 1 argument, "
                f"got {len(args)}"
            )
        values = args[0] if args else ()
        try:
            self._fold = str.casefold  # Python 3
        except AttributeError:
            self._fold = str.lower     # Python 2
        for v in values:
            self.add(v)

    def __repr__(self):
        return '<{}{} at {:x}>'.format(
            type(self).__name__, tuple(self._values.values()), id(self))

    def __contains__(self, value):
        return self._fold(value) in self._values

    def __iter__(self):
        try:
            # Python 2
            return self._values.itervalues()
        except AttributeError:
            # Python 3
            return iter(self._values.values())

    def __len__(self):
        return len(self._values)

    def add(self, value):
        self._values[self._fold(value)] = value

    def discard(self, value):
        try:
            del self._values[self._fold(value)]
        except KeyError:
            pass

Usage demo:

>>> cps = CasePreservingSet(l)
>>> cps
<CasePreservingSet('#TrendinG', '#Yax') at 1047ba290>
>>> '#treNdinG' in cps
True
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Bonus point for presenting a structured solution to this, encapsulated in a class proper. – Per Lundberg Nov 09 '18 at 13:59
  • 1
    This is a good start, but it has several issues: 1) I think `__init__(self, *values)` is a poor init signature for the class, `__init__(self, iterable)` would be more natural 2) there are still missing a tonne of methods, e.g. `update`, which should be implemented 3) casefold should only be attempted on strings, if you add e.g. an int it raises 4) the implementation detail leaks in a way that causes the class to not be sufficiently set like, e.g. `cps | {"a"}` crashes out with a bizarre error `TypeError: descriptor 'casefold' for 'str' objects doesn't apply to a 'generator' object`. – wim Apr 01 '20 at 23:28
  • @wim: you have got some points there, I'll see if I can adjust. `update` is not part of the `MutableSet` protocol, only of `set`, and this was meant as a starting point, really. Also see [this bug report](https://bugs.python.org/issue23161). In the same vein this type **only supports strings**. – Martijn Pieters Apr 07 '20 at 20:53
  • @wim: I can't reproduce your TypeError, `cps | {"a"}` **works just fine**; producing ``. Are you sure you didn't accidentally use `{a}` where `a` was referencing a generator object? – Martijn Pieters Apr 07 '20 at 21:07
  • Here's the reproducer: https://repl.it/@wimglenn/CaseInsensitiveSet – wim Apr 07 '20 at 22:10
  • @wim Ah! I inadvertently have fixed that by making the class accept a single iterable rather than `*values`. Try the updated version: https://repl.it/@mjpieters/CaseInsensitiveSet :-) – Martijn Pieters Apr 07 '20 at 22:30
  • @MartijnPieters Yes, I think that's the correct fix - and actually my reason for point 1). [Some methods in the base assumes such an init signature](https://github.com/python/cpython/blob/9205520d8c43488696d66cbdd9aefbb21871c508/Lib/_collections_abc.py#L485-L486). – wim Apr 07 '20 at 22:43
3

You can use lower() :

>>> set(i.lower() for i in l)
set(['#trending', '#yax'])
Mazdak
  • 105,000
  • 18
  • 159
  • 188
2

You could convert the entire list to lowercase before creating a set.

l = map(lambda s: s.lower(), l)
set(l)
Stepan Grigoryan
  • 3,062
  • 1
  • 17
  • 6
2

Create a case-insensitive set class of your own.

class CaseInsensitiveSet(set):

    def add(self, item):
         try:
             set.add(self, item.lower())
         except Exception:                # not a string
             set.add(self, item)

    def __contains__(self, item):
        try:
            return set.__contains__(self, item.lower())
        except Exception:
            return set.__contains__(self, item)

    # and so on... other methods will need to be overridden for full functionality
kindall
  • 178,883
  • 35
  • 278
  • 309
1

Even tho every answer is using .lower(), your desired Output is capitalized.

And to achieve it you can do this:

l = ['#Trending', '#Trending', '#TrendinG', '#Yax', '#YAX', '#Yax']
l = set(i[0]+i[1:].capitalize() for i in l)
print l

Output:

set(['#Trending', '#Yax'])
f.rodrigues
  • 3,499
  • 6
  • 26
  • 62
0

Another option is with the istr (case-insensitive str) object from the multidict library:

In [1]: from multidict import istr, CIMultiDict                                                                                                               

In [2]: s = {'user-agent'}                                                                                                                                    

In [5]: s = CIMultiDict({istr(k): None for k in {'user-agent'}})                                                                                              

In [6]: s                                                                                                                                                     
Out[6]: <CIMultiDict('User-Agent': None)>

In [7]: 'user-agent' in s                                                                                                                                     
Out[7]: True

In [8]: 'USER-AGENT' in s                                                                                                                                     
Out[8]: True
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235