Suppose we have two dataframes.
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 10.42 |
2 | b | 1.27 |
3 | b | 6.32 |
4 | a | 5.91 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
11 | b | 6.43 |
12 | b | 1.26 |
13 | a | 9.98 |
14 | a | 15.84 |
15 | a | 6.01 |
Then I want to perform a matching on this data so that I end up with a 1-1 correspondence between them. Where the matching maximises the size of the matching and minimises the difference in fuzzy_criteria. I.e the matching would be
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 10.42 | a | 9.98 | 13 | 0.44 |
2 | 1.27 | b | 1.26 | 12 | 0.01 |
3 | 6.32 | b | 6.43 | 11 | 0.11 |
4 | 5.91 | a | 6.01 | 15 | 0.1 |
EDIT:
To highlight the need for a maximal matching consider the following example:
original_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
1 | a | 1 |
2 | a | 2 |
jumbled_data
sequence_number | fixed_criteria | fuzzy_criteria |
---|---|---|
13 | a | 1.9 |
14 | a | 2.9 |
Then a matching would provide (sorted by minimal diff):
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
1 | 1 | a | 2.9 | 14 | 1.9 |
then removing duplicates in sequence_number_original would provide the following
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
1 | 1 | a | 1.9 | 13 | 0.9 |
then in sequence_number_jumbled
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
Equally the other way round would do the same. First sequence_number_jumbled ...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
2 | 2 | a | 2.9 | 14 | 0.9 |
Then sequence_number_original...
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
2 | 2 | a | 1.9 | 13 | 0.1 |
However this is not maximal as there is the following:
sequence_number_original | fuzzy_criteria_original | fixed_criteria | fuzzy_criteria_jumbled | sequence_number_jumbled | fuzz_diff |
---|---|---|---|---|---|
1 | 1 | a | 1.9 | 13 | 0.9 |
2 | 2 | a | 2.9 | 14 | 0.9 |
There are maximal matching algorithms in graph theory. I did actually just see this other post that is similar to mine.