2

I'm relatively new to Python and functions. I'm attempting to iterate the following function through each row of a dataframe and append the computed result for each row to a new column:

def manhattan_distance(x,y):

  return sum(abs(a-b) for a,b in zip(x,y))

For reference, this is the dataframe I'm testing on:

entries = [
{'age1':'2', 'age2':'2'},
{'age1':'12', 'age2': '12'},
{'age1':'5', 'age2': '50'}
]

df=pd.DataFrame(entries)

df['age1'] = df['age1'].astype(str).astype(int)
df['age2'] = df['age2'].astype(str).astype(int)

I've seen this answer How to iterate over rows in a DataFrame in Pandas? and have got as far as this:

import itertools
for index, row in df.iterrows():

    df['distance']=df.apply(lambda row: manhattan_distance(row['age1'], row['age2']), axis=1)

Which returns the following:

-----------------------------------------------------------------------      ----
TypeError                                 Traceback (most recent call  last)
<ipython-input-42-aa6a21cd1de9> in <module>()
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:    manhattan_distance(row['age1'], row['age2']), axis=1)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in   apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4852                         f, axis,
   4853                         reduce=reduce,
-> 4854                         ignore_failures=ignore_failures)
   4855             else:
   4856                 return self._apply_broadcast(f, axis)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4948             try:
   4949                 for i, v in enumerate(series_gen):
-> 4950                     results[i] = func(v)
   4951                     keys.append(v.name)
   4952             except Exception as e:

<ipython-input-42-aa6a21cd1de9> in <lambda>(row)
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:     manhattan_distance(row['age1'], row['age2']), axis=1)

<ipython-input-36-74da75398c4c> in manhattan_distance(x, y)
      1 def manhattan_distance(x,y):
      2 
----> 3   return sum(abs(a-b) for a,b in zip(x,y))
      4  #   return sum(abs(a-b) for a,b in map(lambda x: zip(a,b)))

TypeError: ('zip argument #1 must support iteration', 'occurred at index 0')

Based on other responses to the question I referred above, I have attempted to amend the zip statement in my function:

import itertools
for index, row in df.iterrows():

    df['distance']=df.apply(lambda row: manhattan_distance(row['age1'], row['age2']), axis=1)

The above returns this:

--------------------------------------------------------------------------
TypeError                                 Traceback (most recent call  last)
<ipython-input-44-aa6a21cd1de9> in <module>()
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:   manhattan_distance(row['age1'], row['age2']), axis=1)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   4852                         f, axis,
   4853                         reduce=reduce,
-> 4854                         ignore_failures=ignore_failures)
   4855             else:
   4856                 return self._apply_broadcast(f, axis)

/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
   4948             try:
   4949                 for i, v in enumerate(series_gen):
-> 4950                     results[i] = func(v)
   4951                     keys.append(v.name)
   4952             except Exception as e:

<ipython-input-44-aa6a21cd1de9> in <lambda>(row)
      4 #    print (manhattan_distance(row['age1'],row['age2']))
      5 
----> 6     df['distance']=df.apply(lambda row:  manhattan_distance(row['age1'], row['age2']), axis=1)

<ipython-input-43-5daf167baf5f> in manhattan_distance(x, y)
      2 
      3 #  return sum(abs(a-b) for a,b in zip(x,y))
----> 4    return sum(abs(a-b) for a,b in map(lambda x: zip(a,b)))

TypeError: ('map() must have at least two arguments.', 'occurred at index 0')

If this is the right approach take, I'm unclear what my map() arguments need to be for the function to work.

  • What would be the desired output? How do you compare two characters? – Willem Van Onsem Dec 16 '17 at 19:48
  • Could you please provide a formula to calculate the manhattan distance of two given values for `age1` and `age2`. How is the manhattan distance of two values defined, since I only find definitions for at least four values... – albert Dec 16 '17 at 20:10

1 Answers1

1
import numpy as np
import pandas as pd

entries = [
{'age1':'2', 'age2':'2'},
{'age1':'12', 'age2': '12'},
{'age1':'5', 'age2': '50'}
]

df = pd.DataFrame(entries)
df['age1'] = df['age1'].astype(str).astype(int)
df['age2'] = df['age2'].astype(str).astype(int)

def manhattan_distance(row):
    # https://en.wikipedia.org/wiki/Taxicab_geometry#Formal_definition
    return np.sum(abs(row['age1']-row['age2']))

df['distance'] = df.apply(manhattan_distance, axis=1)
print(df)
albert
  • 8,027
  • 10
  • 48
  • 84
  • But here it makes no sense to `np.sum(..)` since `abs(..)` returns only *one* element. – Willem Van Onsem Dec 16 '17 at 20:17
  • @WillemVanOnsem: You're right. However, I wanted to show an approach for the general definition of Manhattan Distance which itself needs to sum up the difference of `p_i - q_i` – albert Dec 16 '17 at 20:23