Alright so this question might be a little weird so first let me give you a short background.
I am using spintax
in order to generate large blocks of text given a set of optional phrases. I insert the spin
inside a loop with the range from 0 to 10, so it creates multiple strings, each one of them being different.
for i in range(0, 10):
L.append(spintax.spin(
" ----<h1>{" +Title+ " - {køb online|sammenlign {priser|online supermarkederne}} via x.dk|Få din "+y+ "\
leveret til døren og spar penge via x.dk|Køb din "+y+ " online og spar penge via x.dk }\
\n \
----<h2>{{Få adgang til|vælg fra} {et stort|Danmarks største} {udvalg} af} " +y+ "<h2>\
\n \
{Når|Hvis} du {besøger|handler ind gennem|benytter|køber ind via|køber dine varer via}\
x.dk, {er det {vigtigt|væsentligt} at forstå|skal du huske|skal du vide}"))
L2.append(df['ID'][index])
df2 = pd.DataFrame(np.column_stack([L, L2]), columns=['Text' ,'ID'])
Right, so this is an example of how my code looks like. L
is a list that takes the generated text and L2
is a list of IDs (not going to explain what's up with that list too as it's off-topic). My df2
DataFrame will therefore look like this:
Index Text Id
0 <h1>Få din Mælk & Fløde leveret til 4169
døren og spar penge via...
1 <h1>Mælk & Fløde - køb online via x.dk 4169
....
12 <h1>Få din Yoghurt leveret til døren 4178
og spar penge via
....
So at this point, there are 10 text strings for every Id. I need to bring these down to 1, and here my issues are starting. I want to make sure that these text strings all differ from one another to some extent. From these 10 strings per Id I will need to choose 1 that will be differ from the strings of other Ids.
Hopefully, that kinda makes sense...
As a summary, if you got lost on the way: is there any way to compare the similarity between strings? A way to compare text strings and choose the string which is the most different out of all of them?