Sorting all rows and columns by most similar strings across dataframe

Question

I have a dataframe with many columns and rows, completely unsorted. I would like to sort rows in each column such that the each element in each row is next to the most similar one in all other columns. I'm aware of this solution, but I'd like to have this being compared among multiple lists / columns, like a sorting algorithm.

     col1     col2      col3      col4
0    Some     Black     Red       Sky
1    Blue     Green     Floor     Bucket
2    Blacky   Same      Green     Rad

Let's consider that the first letter (first 2? 3?) must be an exact hit. It would result in

     col1     col2      col3      col4
0    Some     Same                Sky       #Some and Same are very similar, Sky is the closest in col4                                 
1    Blue                                   #No match for Blue 
2    Blacky   Black               Bucket    #Black,Blacky are similar, and more similar to Bucket
3             Green     Green               #Exact match
4                       Red       Rad       #Similar Match
5                       Floor               #No word started with F in any other column

To solve this I thought of either:

For each column, building a unique list of all elements in all other columns and align the target column vs the consensus. But I'm not sure how this would go afterwards
For each element, find the closest 1 element for each of the other columns.
Simply sorting alphabetically.

All the above have problems, and in all situations I will need to do some manual inspection (which is fine).

> Start with two tables with as many rows as there in column 1 and the same number of columns.; > For each row: 1. Find number of columns - 1 matches from each of the remaining columns. Add best match based on some numerical measure (eg. edit distance) to corresponding column in the first data-frame and the value of the measure to the other; if first letter does not match fill nan in the first data frame and 0 in the second.; 2. (Assuming no duplicates) if a word is already seen in a column, keep the one with higher value, drop the other. (contd.) — yad, Jul 24 '20 at 07:56
> Once you're out of words in the first row, create a new row for the next available word, padded for missing columns.; > Repeat till all words are exhausted. This is an idea that *should* work in theory. Not the most efficient though. — yad, Jul 24 '20 at 08:00
I said keep the "higher value", but whether you want to keep the greater or the lesser valued word depends on the choice of measuring function. — yad, Jul 24 '20 at 08:07
@119631 what was your idea of how to concatenate with the search from additional columns? Starting with col1, I found the best matches in each of the remaining columns, but how to plug in the best matches based on col2? and col3?... — Sos, Jul 24 '20 at 10:19

score 0 · Answer 1 · answered Jul 24 '20 at 11:59

Assign an id to every word.
Precompute the Levenshtein distances between every word and every word from the other columns. Store the result in a square matrix, indexed by the ids of the words.
Define the cost of a row to be the sum of distances between all pairs of words in the row.
Now you need to assign a row to each word, so as to minimize the sum of costs of all rows, while respecting the constraint that words from the same column must be on different rows. This optimisation problem is a quadratic assignment problem that can easily be passed to a quadratic integer programming library.

In step 1, when attributing ids, ids must be unique, so consider all words to be different. It doesn't matter if two words are the same; give them different ids anyway.

In step 2, you'll need to put some default value in the cells that correspond to pairs of words from the same column; the value you choose doesn't matter.

In step 4, the constraint can be reworded as "every row is only present once in every column", which is a simple linear constraint.

Sorting all rows and columns by most similar strings across dataframe

1 Answers1