1

I am using difflib's get_closest_matches to return N=3 best matches for each value in my input vector. I want to store the output in a single column in the dataframe, such as:

input    output
"xyz"    "xyz"
"xyz"    "xzy"
"xyz"    "xxy"
"pqr"    "pqr" 
...

What should I return from a call to apply that will automatically expand/broadcast the input to N outputs? For example, this will return the output as a list:

data["output"] = data["input"].apply(lambda x: difflib.get_close_matches(x, possibilities))

In this form, it would require many iterative calls to concat to unpack the list in each row. There must be a more straight-forward method that I am missing.

There are similar questions, such as this one Returning multiple values from pandas apply on a DataFrame, however they all expand the output into separate columns, whereas I need it in a single column.

Edit: As IanS correctly points out, possiblities in this case is

possibilities = ['xyz', 'xzy', 'xxy', 'pqr']
Greg Brown
  • 1,251
  • 1
  • 15
  • 32

1 Answers1

1

With the following example:

possibilities = ['xyz', 'xzy', 'xxy', 'pqr']

First, make the output a pandas series so the result is in three columns:

output = data["input"].apply(
    lambda x: pd.Series(difflib.get_close_matches(x, possibilities))
)

Output:

     0    1    2
0  xyz  xzy  xxy
1  pqr  NaN  NaN

Second, join and unstack, you're almost where you want to be:

result = data.join(output).set_index('input').unstack()

Output:

   input
0  xyz      xyz
   pqr      pqr
1  xyz      xzy
   pqr      NaN
2  xyz      xxy
   pqr      NaN

Third, all that is left is some beautification, for instance:

result.rename('output').reset_index(level=1).sort_values('input').dropna()

Output:

  input output
0   pqr    pqr
0   xyz    xyz
1   xyz    xzy
2   xyz    xxy
IanS
  • 15,771
  • 9
  • 60
  • 84
  • That's it, thanks. I was getting hung up on the fact it can return < N matches for some inputs, so I thought I had to avoid outputting the results as separate columns, but that it is a lot easier to drop the NaNs afterwards! – Greg Brown Jan 03 '17 at 11:52