1

I have a similar problem to the links provided in the following references with minor differences but want the same results:

I have on dataframe and want to get the partial ratio and token between 2 columns within the dataframe. Column 1 is just one word per row, but column 2 is a list of words with each row varying in size(I changed it to a tuple to make the functions in the references work).

The main issue I get is that in the compare it goes through column 1 and compares each element to every element in column 2 thus creating a massive dataframe when I just want it 1 to 1. How can I fix this?

df = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5, 6],
        "fruits": ["apple", "apples", "orange", "apple tree", "oranges", "mango"],
        "choices": [
            ("app", "apull", "apple"),
            ("app", "apull", "apple", "appple"),
            ("orange", "org"),
            ("apple"),
            ("oranges", "orang"),
            ("mango"),
        ],
    }
)
    
   id      fruits      choices
0   1       apple      ('app', 'apull', 'apple')
1   2      apples      ('app', 'apull', 'apple', 'appple')
2   3      orange      ('orange', 'org')
3   4  apple tree      ('apple')
4   5     oranges      ('oranges', 'orang')
5   6       mango      ('mango')

What compare gives me in the variable explorer:

compare = pd.MultiIndex.from_product([df['fruits'], df['choices']]).to_series()

             fruits      choices
    0         apple      ('app', 'apull', 'apple')
    1         apple      ('app', 'apull', 'apple', 'appple')
    2         apple      ('orange', 'org')
    3         apple      ('apple')
    4         apple      ('oranges', 'orang')
    5         apple      ('mango')
    6         apples     ('app', 'apull', 'apple')
    7         apples     ('app', 'apull', 'apple', 'appple')
    8         apples     ('orange', 'org')
    ...

Is it possible to get the desired output like the first output in reference 1 but the multi-indexed elements as the choices?

Expected output like in reference #1, but I want the choices multi-indexed:

Expected output

Laurent
  • 12,287
  • 7
  • 21
  • 37

1 Answers1

1

Here is one way to do it with Pandas explode and groupby:

from fuzzywuzzy import fuzz

new_df = (
    df.explode("choices")
    .drop(columns="id")
    .pipe(
        lambda df_: df_.assign(
            ratio=df_.apply(lambda x: fuzz.ratio(x["fruits"], x["choices"]), axis=1),
            token=df_.apply(
                lambda x: fuzz.token_sort_ratio(x["fruits"], x["choices"]), axis=1
            ),
        )
    )
    .groupby(["fruits", "choices"])
    .agg(list)
    .applymap(lambda x: x[0])
)

Then:

print(new_df)
# Output
                    ratio  token
fruits     choices
apple      app         75     75
           apple      100    100
           apull       60     60
apple tree apple       67     67
apples     app         67     67
           apple       91     91
           appple      83     83
           apull       55     55
mango      mango      100    100
orange     orange     100    100
           org         67     67
oranges    orang       83     83
           oranges    100    100
Laurent
  • 12,287
  • 7
  • 21
  • 37