2

I'm a newbie to pandas dataframe, and I wanted to apply a function taking couple of rows in the same column. Like when you apply the function diff(), but i want to calculate the distance between text. so i defined a function which measure the distance, and i tried to use apply but i don't know how can i pick couple of rows. Below i show an example that i'have tried to do and what i expected:

def my_measure_function(x,y):
   return some_distance_calculus(x,y)

>>> from pandas import DataFrame
>>> df = DataFrame({"text": ['hello','hella','hel'], "B": [3,4,4]})
>>> df['dist'] = df.apply(lambda x, y: my_measure_function(x, y), axis=0)

but it doesn't work. What i want to obtain is:

>>> df
      text  B  dist
0    hello  3    0
1    hella  4    1
2    hel    4    2

Thanks in advance for any help that you can provide me.

M. Moresi
  • 93
  • 2
  • 8

2 Answers2

4

You may wish to avoid pd.DataFrame.apply, as performance may suffer. Instead, you can use map with pd.Series.shift:

df['dist'] = list(map(my_measure_function, df['text'], df['text'].shift()))

Or via a list comprehension:

zipper = zip(df['text'], df['text'].shift())
df['dist'] = [my_measure_function(val1, val2) for val1, val2 in zipper]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks for your advice, but i need to compare rows from the same column, in this case "text" column. I need to calculate the distance between 'hella' vs 'hello' and 'hel' vs 'hella' and put the result in a new columm. – M. Moresi Oct 08 '18 at 23:47
  • @M.Moresi, Ah, I see, I've added a couple of options there. – jpp Oct 08 '18 at 23:50
  • @Wen, Yep, I've recently had bad experience with `apply` though [here](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c/52674448#52674448) which is why I'm avoiding! – jpp Oct 08 '18 at 23:58
  • @jpp since he asking the apply function - that is why I am using the `apply ` – BENY Oct 08 '18 at 23:58
  • I had to wait a minute to get my vote back, but yours definitely deserves +1 as the simplest solution :) – jpp Oct 09 '18 at 00:00
1

For diff, which is s-s.shift(), so in your function you can do

df['shifttext']=df.text.shift()
df.apply(lambda x : my_measure_function(x['text'],x['shifttext']))
BENY
  • 317,841
  • 20
  • 164
  • 234