0

I wish to use a custom compare function while calculating set. I wish to take advantage of the efficiencies of using set algorithm. technically I could create a double for loop to compare the two lists (keep, original) but I thought this might not be efficient.

eg://

textlist = ["ravi is happy", "happy ravi is", "is happy ravi", "is ravi happy"]

set() should return only 1 of these elements as the compare function would return if True if similarity between comparing items >= threshold.

In python. Thanks.

P.S.

The real trick is that I'd like to use my string_compare(t1,t2): Float to do the comparison rather then hashing and equal...

P.S.S.

C# has similar function: How to remove similar string from a list?

Rav B
  • 11
  • 2
  • 1
    That's not how sets work. They're implemented using a hash table, so you need to write a hash function that returns the same value for all similar strings. – Barmar Apr 15 '20 at 23:07
  • Also, the hash function is associated with the objects, not the set. So you'd need to create a subclass of `str`. – Barmar Apr 15 '20 at 23:28
  • if for example. I convert each string to a INT of unique ID for each char (assuming ASCII) will __eq__ be called when calculating if the two elements are equal? so far I have something like this: `class string_wrap(object): def __init__(self, t): self.t=t def __eq__(self, other): return string_compare(self.t, other.t) >= simthreshold def __hash__(self): return hash(self.t)` – Rav B Apr 15 '20 at 23:33
  • It uses `__hash__` to find the hash bucket, then searches the bucket to find an element that's `__eq__`. – Barmar Apr 16 '20 at 00:04
  • What is `t`? The hash function has to return a number. – Barmar Apr 16 '20 at 00:06
  • See https://www.asmeurer.com/blog/posts/what-happens-when-you-mess-with-hashing-in-python/ for a good explanation. – Barmar Apr 16 '20 at 00:07
  • t is a string in this case. I use wrap each element in the list with string_wrap. then compute set on the new list. – Rav B Apr 16 '20 at 00:12
  • Oh, I see you wrote `hash(self.t)`. But that won't return the same hash code for all the similar strings. – Barmar Apr 16 '20 at 00:12
  • You could try something like `return hash(self.t.sorted())` so the hash code won't depend on the character order. – Barmar Apr 16 '20 at 00:13
  • But this really depends on your definition of similar. Are `happy` and `hippy` similar, since they only have 1 character difference? – Barmar Apr 16 '20 at 00:14
  • similar to this post https://stackoverflow.com/questions/16306844/custom-comparison-functions-for-built-in-types-in-python the similarity function I wrote it's called string_compare returns 0->1 float – Rav B Apr 16 '20 at 00:14

1 Answers1

1

I think this is what you were looking for:

{' '.join(sorted(sentence.split())) for sentence in textlist}

This re-orders the string and therefore Python set will now work because we are comparing identical strings.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Dev
  • 665
  • 1
  • 4
  • 12