0

Suppose we have two dataframes.

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 10.42
2 b 1.27
3 b 6.32
4 a 5.91

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
11 b 6.43
12 b 1.26
13 a 9.98
14 a 15.84
15 a 6.01

Then I want to perform a matching on this data so that I end up with a 1-1 correspondence between them. Where the matching maximises the size of the matching and minimises the difference in fuzzy_criteria. I.e the matching would be

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 10.42 a 9.98 13 0.44
2 1.27 b 1.26 12 0.01
3 6.32 b 6.43 11 0.11
4 5.91 a 6.01 15 0.1

EDIT:

To highlight the need for a maximal matching consider the following example:

original_data

sequence_number fixed_criteria fuzzy_criteria
1 a 1
2 a 2

jumbled_data

sequence_number fixed_criteria fuzzy_criteria
13 a 1.9
14 a 2.9

Then a matching would provide (sorted by minimal diff):

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9
1 1 a 2.9 14 1.9

then removing duplicates in sequence_number_original would provide the following

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
1 1 a 1.9 13 0.9

then in sequence_number_jumbled

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

Equally the other way round would do the same. First sequence_number_jumbled ...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1
2 2 a 2.9 14 0.9

Then sequence_number_original...

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
2 2 a 1.9 13 0.1

However this is not maximal as there is the following:

sequence_number_original fuzzy_criteria_original fixed_criteria fuzzy_criteria_jumbled sequence_number_jumbled fuzz_diff
1 1 a 1.9 13 0.9
2 2 a 2.9 14 0.9

There are maximal matching algorithms in graph theory. I did actually just see this other post that is similar to mine.

  • You can consider this as "join 2 dataframes with `fixed_criteria` and drop duplicates of `sequence_number_original` by keeping only the minimum `fuzz_diff`". – Emma Apr 11 '22 at 15:44
  • Thanks @Emma, you are right. The trouble is with the second part of your statement. How would you do this? To me it isn't trivial as you could do this for each sequence_number_original then sequence_number_jumbled however you may get rid of connections that were valid but just not as good as the minimal match on the left or the right. – AlexWendland Apr 11 '22 at 16:12
  • Do you need to drop by just `sequence_number_original`? (`sort_values(['sequence_number_original', 'fuzz_diff').drop_duplicates(['sequence_number_original'])`) In your example, I see unique `sequence_number_original` is reserved but `sequence_number_jumbled` is whatever matching the criteria. – Emma Apr 11 '22 at 16:16
  • Thanks @Emma I just edited the question to be a little clearer on this point. Does that make sense? – AlexWendland Apr 12 '22 at 08:52

2 Answers2

0

If there are no duplicated values on both fuzzy_criteria columns. You can create an auxiliary dataframe to determine the nearest value between two fuzzy_criteria columns.

from itertools import product

df = pd.DataFrame(sorted(product(original_data['fuzzy_criteria'], jumbled_data['fuzzy_criteria']), key=lambda t: abs(t[0]-t[1])))
df = df.drop_duplicates(0, keep='first')
df = df.drop_duplicates(1, keep='first')
print(df)

       0     1
0   1.27  1.26
1   5.91  6.01
2   6.32  6.43
4  10.42  9.98

Then use this auxiliary dataframe to merge these two dataframe separately and finally merge the merged dataframes based on auxiliary dataframe columns.

df_ = pd.merge(
    (pd.merge(original_data, df, left_on='fuzzy_criteria', right_on=0)),
    (pd.merge(df, jumbled_data, left_on=1, right_on='fuzzy_criteria')),
    on=[0,1],
    suffixes=('_original', '_jumbled')
).drop([0, 1], axis=1)
df_['fuzz_diff'] = (df_['fuzzy_criteria_original'] - df_['fuzzy_criteria_jumbled']).abs()
   sequence_number_original fixed_criteria_original  fuzzy_criteria_original  \
0                         1                       a                    10.42
1                         2                       b                     1.27
2                         3                       b                     6.32
3                         4                       a                     5.91

   sequence_number_jumbled fixed_criteria_jumbled  fuzzy_criteria_jumbled  \
0                       13                      a                    9.98
1                       12                      b                    1.26
2                       11                      b                    6.43
3                       15                      a                    6.01

   fuzz_diff
0       0.44
1       0.01
2       0.11
3       0.10
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
  • Hi @Ynjxsjmh thanks for the answer. The issue is keeping the duplicates with minimal weighting might reduce the size of the pairing. Please see me edited question for a clarification. – AlexWendland Apr 12 '22 at 08:54
0

This is largely copied from @SpghttCd answer to How to get the most pairs out of my pandas dataframe?

The idea is to use networkx to perform a maximal matching.

import pandas as pd
import networkx as nx

# Data input

original_data = pd.DataFrame({
    'sequence_number' : [1,2,3,4],
    'fixed_criteria' : ['a','b','b','a'],
    'fuzzy_criteria' : [10.42, 1.27, 6.32, 5.91]
})

jumbled_data = pd.DataFrame({
    'sequence_number' : [11,12,13,14,15],
    'fixed_criteria' : ['b','b','a','a','a'],
    'fuzzy_criteria' : [6.43, 1.26, 9.98, 15.84, 6.01]
})

# Merge along fixed criteria

joined_data = pd.merge(
    original_data,
    jumbled_data,
    how = 'inner',
    on = ['fixed_criteria'],
    suffixes=['_original','_jumbled']
)

# To use max weight, take the reciricol of the difference (if they are the non-
# unique values this will have to be changed)

joined_data['weight'] = (1/abs(
    joined_data['fuzzy_criteria_original'] -
    joined_data['fuzzy_criteria_jumbled']
))

# Form graph

matching_graph = nx.from_pandas_edgelist(
    joined_data,
    source = 'sequence_number_original',
    target = 'sequence_number_jumbled',
    edge_attr = 'weight'
)

# Find matching

mathing = nx.max_weight_matching(
    matching_graph,
    weight = 'weight'
)

# Convert results back into dataframe and format

results = pd.DataFrame(
    list(mathing),
    columns=['sequence_number_original', 'sequence_number_jumbled']
)

results = pd.merge(
    results,
    joined_data,
    how = 'inner',
    on = ['sequence_number_original', 'sequence_number_jumbled'],
)

results['fuzzy_difference'] = abs(
    results['fuzzy_criteria_original'] -
    results['fuzzy_criteria_jumbled']
)

print(results)