1

I had always thought that .map and .replace were essentially the same, except you would use .replace when you want to pass the values for keys not in the provided dictionary. However, I'm confused as to why .replace throws a TypeError when passed a dictionary with tuples as the key, while .map works as expected with the same dictionary.

For example:

import pandas as pd
df = pd.DataFrame({'ID1': [1, 2, 3, 4, 5], 
                   'ID2': ['A', 'B', 'C', 'D', 'E']})
df['tup_col'] = pd.Series(list(zip(df.ID1, df.ID2)))

dct = {(1, 'A'): 'apple', (3, 'C'): 'banana', (5, 'X'): 'orange'}

df.tup_col.map(dct)
#0     apple
#1       NaN
#2    banana
#3       NaN
#4       NaN
#Name: tup_col, dtype: object

df.tup_col.replace(dct)

TypeError: Cannot compare types 'ndarray(dtype=object)' and 'tuple'

So can I not use replace in the case of a dictionary with tuples as the keys?

jpp
  • 159,742
  • 34
  • 281
  • 339
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Looking at the docs, the functions are not related. `replace` seems to relate to `apply`, rather than `map`. The doc also says you can use nested dict, but trying on your example I get 'tuple has no method replace', so that doesn't work either. – MrE Jul 12 '18 at 15:44

1 Answers1

2

No, this won't work

First Pandas takes keys and values from your dictionary and then calls replace with these iterables:

keys, values = zip(*items)
to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

Next, since you now have list_like keys and values, it feeds into replace_list:

elif is_list_like(to_replace):  # [NA, ''] -> [0, 'missing']
    if is_list_like(value):
        new_data = self._data.replace_list(src_list=to_replace, dest_list=value,
                                           inplace=inplace, regex=regex)

Next, replace_list attempts to perform a comparison between an array of tuples and an array of values:

def comp(s):
    if isnull(s):
        return isnull(values)
    return _possibly_compare(values, getattr(s, 'asm8', s),
                             operator.eq)

masks = [comp(s) for i, s in enumerate(src_list)]

Finally, _possibly_compare checks if the values consist of scalars while the keys are array-like, causing an error:

if is_scalar(result) and (is_a_array or is_b_array):
    raise TypeError("Cannot compare types %r and %r" % tuple(type_names))

There are bits, possibly important bits, I've excluded here. But hopefully you get the gist.

Conclusion

In my opinion, pd.Series.replace has serious problems. Unlike most of the Pandas API, it is often unpredictable, both in what it achieves and in performance. It's also clear blocks of it are written in pure Python and do not perform well.

The documentation sums up the ambiguity well:

This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.

pd.Series.map is efficient and doesn't suffer from the pure Python logic implemented in replace.

See Replace values in a pandas series via dictionary efficiently for another example.

Stick with map and don't look back to replace.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Wow, thank you for the great explanation! Your other answer in the link with mapping and then `fillna` with the other column is pretty much exactly what I was looking for! – ALollz Jul 12 '18 at 15:55