0

I have a pandas dataframe with a format exactly like the one in this question and I'm trying to achieve the same result. In my case, I am calculating the fuzz-ratio between the row's index and it's corresponding col.

If I try this code (based on the answer to the linked question)

def get_similarities(x):
    return x.index + x.name

test_df = test_df.apply(get_similarities)

the concatenation of the row index and col name happens cell-wise, just as intended. Running type(test_df) returns pandas.core.frame.DataFrame, as expected.

However, if I adapt the code to my scenario like so

def get_similarities(x):
    return fuzz.partial_ratio(x.index, x.name)

test_df = test_df.apply(get_similarities)

it doesn't work. Instead of a dataframe, I get back a series (the return type of that function is an int)

I don't understand why the two samples would not behave the same nor how to fix my code so it returns a dataframe, with the fuzzy.ratio for each cell between the a row's index for that cell and the col name for that cell.

mcansado
  • 2,026
  • 4
  • 25
  • 39

3 Answers3

1

what about the following approach?

assuming that we have two sets of strings:

In [245]: set1
Out[245]: ['car', 'bike', 'sidewalk', 'eatery']

In [246]: set2
Out[246]: ['walking', 'caring', 'biking', 'eating']

Solution:

In [247]: from itertools import product

In [248]: res = np.array([fuzz.partial_ratio(*tup) for tup in product(set1, set2)])

In [249]: res = pd.DataFrame(res.reshape(len(set1), -1), index=set1, columns=set2)

In [250]: res
Out[250]:
          walking  caring  biking  eating
car            33     100       0      33
bike           25      25      75      25
sidewalk       73      20      22      36
eatery         17      33       0      50
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • For the sake of brain teasing, how can this be accomplished via `pd.DataFrame.apply`? – iDrwish Jun 18 '18 at 22:09
  • @iDrwish, it could be pretty easy if `fuzz.partial_ratio` would be able to work with vectors, instead of strings... I don't know how to overcome this limitation using `.apply()`... – MaxU - stand with Ukraine Jun 18 '18 at 22:11
0

It took some digging, but I figured it out. The problem comes from the fact that DataFrame.apply is either applied column-wise or row-wise, not cell by cell. So your get_similarities function is actually getting access to an entire row or column of data at a time! By default it gets the entire column -- so to solve your problem, you just have to make a get_similarities function that returns a list where you manually call fuzz.partial_ratio on each element, like this:

import pandas as pd 
from fuzzywuzzy import fuzz

def get_similarities(x):

    l = []
    for rname in x.index:
        print "Getting ratio for %s and %s" % (rname, x.name)
        score = fuzz.partial_ratio(rname,x.name)
        print "Score %s" % score
        l.append(score)

    print len(l)
    print
    return l



a = pd.DataFrame([[1,2],[3,4]],index=['apple','banana'], columns=['aple','banada'])
c = a.apply(get_similarities,axis=0)

print c
print type(c)

I left my print statements in their so you can see what the DataFrame.apply call is doing for yourself -- that's when it clicked for me.

killian95
  • 803
  • 6
  • 11
0

There is a way to accomplish this via DataFrame.apply with some row manipulations.

Assuming the 'test_df` is as follows:

In [73]: test_df
Out[73]: 
                  walking          caring          biking          eating
car            carwalking       carcaring       carbiking       careating
bike          bikewalking      bikecaring      bikebiking      bikeeating
sidewalk  sidewalkwalking  sidewalkcaring  sidewalkbiking  sidewalkeating
eatery      eaterywalking    eaterycaring    eaterybiking    eateryeating

In [74]: def get_ratio(row):
    ...:     return row.index.to_series().apply(lambda x: fuzz.partial_ratio(x, 
    ...: row.name))
    ...: 

In [75]: test_df.apply(get_ratio)
Out[75]: 
          walking  caring  biking  eating
car            33     100       0      33
bike           25      25      75      25
sidewalk       73      20      22      36
eatery         17      33       0      50
iDrwish
  • 3,085
  • 1
  • 15
  • 24