Case-insensitive set intersection

Question

What would be the best way to do the following case-insensitive intersection:

a1 = ['Disney', 'Fox']
a2 = ['paramount', 'fox']
a1.intersection(a2)
> ['fox']

Normally I'd do a list comprehension to convert both to all lowercased:

>>> set([_.lower() for _ in a1]).intersection(set([_.lower() for _ in a2]))
set(['fox'])

but it's a bit ugly. Is there a better way to do this?

Not really; about the best you can do is to convert everything to lower case in one pass, and keep it that way for the remainder of your processing. — Prune, Oct 12 '18 at 16:20
https://stackoverflow.com/questions/1479979/case-insensitive-comparison-of-sets-in-python — mad_, Oct 12 '18 at 16:23
Similar question: [Case-insensitive comparison of sets in Python](https://stackoverflow.com/q/1479979/674039). Not a great dupe because that one is about `frozenset`, and the set literal syntax is not available. — wim, Oct 12 '18 at 16:30

wim · Accepted Answer · 2018-10-12T16:37:30.520

8

Using the set comprehension syntax is slightly less ugly:

>>> {str.casefold(x) for x in a1} & {str.casefold(x) for x in a2}
{'fox'}

The algorithm is the same, and there is not any more efficient way available because the hash values of strings are case sensitive.

Using str.casefold instead of str.lower will behave more correctly for international data, and is available since Python 3.3+.

edited Oct 12 '18 at 16:37

answered Oct 12 '18 at 16:23

wim

338,267
99
616
750

score 1 · Answer 2 · answered Oct 12 '18 at 16:31

1

There are some problems with definitions here, for example in the case that a string appears twice in the same set with two different cases, or in two different sets (which one do we keep?).

With that being said, if you don't care, and you want to perform this sort of intersections a lot of times, you can create a case invariant string object:

class StrIgnoreCase:
  def __init__(self, val):
    self.val = val

  def __eq__(self, other):
    if not isinstance(other, StrIgnoreCase):
        return False

    return self.val.lower() == other.val.lower()

  def __hash__(self):
    return hash(self.val.lower())

And then I'd just maintain both the sets so that they contain these objects instead of plain strings. It would require less conversions on each creation of new sets and each intersection operation.

answered Oct 12 '18 at 16:31

Barak Itkin

4,872
1
22
29

Why not store the *lowered* version on the instance instead of having to do that repeatedly: trade cpu cycles with space. – Moses Koledoye Oct 12 '18 at 23:50
@MosesKoledoye i can imagine cases where you might want to keep the original string. If keeping all strings lower case was an option, i guess the question wouldn't have been asked – Barak Itkin Oct 13 '18 at 00:00
I was suggesting to keep both :P – Moses Koledoye Oct 13 '18 at 08:34

Case-insensitive set intersection

2 Answers2