Choosing strings that are most different from each other in Python

Question

Alright so this question might be a little weird so first let me give you a short background.

I am using spintax in order to generate large blocks of text given a set of optional phrases. I insert the spin inside a loop with the range from 0 to 10, so it creates multiple strings, each one of them being different.

for i in range(0, 10):
   L.append(spintax.spin(            
  " ----<h1>{" +Title+ " - {køb online|sammenlign {priser|online supermarkederne}} via x.dk|Få din "+y+ "\
  leveret til døren og spar penge via x.dk|Køb din "+y+ " online og spar penge  via x.dk }\
  \n  \
  ----<h2>{{Få adgang til|vælg fra} {et stort|Danmarks største} {udvalg} af} " +y+ "<h2>\
  \n  \
  {Når|Hvis} du {besøger|handler ind gennem|benytter|køber ind via|køber dine varer via}\
  x.dk, {er det {vigtigt|væsentligt} at forstå|skal du huske|skal du vide}"))

  L2.append(df['ID'][index])
df2 = pd.DataFrame(np.column_stack([L, L2]), columns=['Text' ,'ID'])

Right, so this is an example of how my code looks like. L is a list that takes the generated text and L2 is a list of IDs (not going to explain what's up with that list too as it's off-topic). My df2 DataFrame will therefore look like this:

Index            Text                                 Id
0            <h1>Få din Mælk & Fløde leveret til      4169
             døren og spar penge via...
1            <h1>Mælk & Fløde - køb online via x.dk   4169
              ....
12           <h1>Få din Yoghurt leveret til døren     4178
             og spar penge via 
              ....

So at this point, there are 10 text strings for every Id. I need to bring these down to 1, and here my issues are starting. I want to make sure that these text strings all differ from one another to some extent. From these 10 strings per Id I will need to choose 1 that will be differ from the strings of other Ids.

Hopefully, that kinda makes sense...

As a summary, if you got lost on the way: is there any way to compare the similarity between strings? A way to compare text strings and choose the string which is the most different out of all of them?

So it depends on your definition of "most different". Is A more different from Z than B is? Or do you want it to be based on which has the most words that aren't in other ones? That is the root of how to solve the problem as well. — MyNameIsCaleb, Oct 01 '19 at 13:28
You can use [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) for this, but since you are dealing with a large set of data this becomes an optimization problem, which can be tricky. — 0x5453, Oct 01 '19 at 13:28
@MyNameIsCaleb The first option. The word similarity should be low, so it doesn't look like it's the same thing all over again. — Questieme, Oct 01 '19 at 13:31
@0x5453 I will look into that, thanks! The data is not that large, though - the table has around 500 rows. The content of the `Text` column, on the other hand, is pretty big. — Questieme, Oct 01 '19 at 13:32

Michael Gardner · Accepted Answer · 2019-10-01T18:54:46.597

In the below data, Text in Index 0 & 2 and Text in Index 4 & 5 are the most similar among each unique Id since they contain text from each other. So the least similar are Index 1 & 3 among each Id

To find the least similar Text we can use TF-IDF to encode each Text into a numeric vector. We then find the euclidean distance between each pair of rows within each group and sum the distances for each row and assume the max mean is the least similar. Finally, we grab the index with the largest mean for each group of Id's.

Data:

| Index | Text                                                       | Id   |
|-------|------------------------------------------------------------|------|
| 0     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4169 |
| 1     | Mælk & Fløde - køb online via x.dk                         | 4169 |
| 2     | Fløde leveret til døren og spar penge via...               | 4169 |
| 3     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |
| 4     | Mælk & Fløde - køb online via x.dk                         | 4170 |
| 5     | køb online via x.dk                                        | 4170 |

In:

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cdist

df = pd.read_clipboard()
df.columns = df.columns.str.strip()

v = TfidfVectorizer()
X = v.fit_transform(df['Text'])

df = df.join(pd.DataFrame(X.toarray()))

group = df.groupby('Id', as_index=False)

df = group.apply(lambda x : x.iloc[cdist(x.iloc[:,3:].values, x.iloc[:,3:].values).mean(axis=0).argmax()])

df[['Index', 'Text', 'Id']]

Out:

|   | Index | Text                                                       | Id   |
|---|-------|------------------------------------------------------------|------|
| 0 | 1     | Mælk & Fløde - køb online via x.dk                         | 4169 |
| 1 | 3     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |

Woah, this solution is pretty awesome. Thank you very, very much! — Questieme, Oct 02 '19 at 08:42

Choosing strings that are most different from each other in Python

1 Answers1