0

Using vectorization, I want to remap a column according to a nested dictionary depending on the value of a second column. In iteration logic, I would go through the rows; depending on the value in the second column, I would choose an entry from the dictionary's first level (which is a dictionary again) according to which the values are mapped.

MWE

I have a data frame with two columns of interest: country and variable:

df= pd.DataFrame({"country": ["AA", "BB", "AA", "AA", "BB", "AA"],
                "variable": ["foo/tball", "palace", "bla", "car", "bla", "dog"]})

I further have a nested dictionary dct (values in variable are only unique in combination with the country, and I thus cannot unnest the dictionary). dct contains strings of entries in df['variable'] and substrings thereof:

dct = {"AA": {'foo': 'NEWFOO', # substring of 'foo/tball'
          'bla' : 'NEWBLA', 
          'cart': 'this value is not in the dataframe'}, # sic! -- not substring of any entry
       "BB": {'pal': 'NEWPAL', # substring of palace
          'bla': 'DIFFERENT_NEWBLA'},
       "CC": {"this": "'CC' dictionary can be ignored"}}

I now want to map the entry of df['variable'] according to dct[df['country']] following these rules:

  • if the string matches or if the dictionary key is a substring of the entry in df['variable'] (e.g. 'foo' in dct['AA']: replace the entry according to the dictionary
  • otherwise (i.e. if the cell entry is not in the dictionary (e.g. 'dog'), replace with some predefined value (here, _some_flag_value_)
  • ignore entries in the dictionary that do not match either on the 'country' column (e.g. dct["CC"]) or the 'variable' column (e.g. 'cart' in dct['AA']).

The desired output should look like this:

out = pd.DataFrame({"country": ["AA", "BB", "AA", "AA", "BB", "AA"],
                    "variable": ["NEWFOO", "NEWPAL", "NEWBLA", "_some_flag_value_", 
                                 "DIFFERENT_NEWBLA", "_some_flag_value_"]})

I have tried using df.mapapply() with several combinations of lambda to no avail - can someone point me in the right direction? Thanks a lot in advance.

Ivo
  • 3,890
  • 5
  • 22
  • 53

1 Answers1

1

You can use pd.Series.str.cat to join both the columns with custom separator then use pd.Series.map

def f(x):
    c, v = x.split('-')
    d = dct.get(c)
    for k,val in d.items():
        if k in v:
           return val
    else:
        return '_some_flag_value_'

df.assign(variable = df['country'].str.cat(df['variable'],sep='-').map(f))

  country           variable
0      AA             NEWFOO
1      BB             NEWPAL
2      AA             NEWBLA
3      AA  _some_flag_value_
4      BB   DIFFERENT_NEWBLA
5      AA  _some_flag_value_

Or using df.apply1

def f(x):
    c, v = x
    d = dct.get(c)
    for k,val in d.items():
        if k in v:
           return val
    else:
        return '_some_flag_value_'

df.assign(variable = df.apply(f,axis=1))

Timeit results:

Benchmarked using the dataframe given in the question.

In [84]: %timeit df.assign(variable = df['country'].str.cat(df['variable'],sep='-').map(f))
    ...:
694 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [82]: %timeit df.assign(variable = df.apply(f,axis=1))
    ...:
915 µs ± 40.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

1 df.apply over axis is generally avoided as it's slow and for-loops under the hood. Take a look at this answer by cs95

Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • thanks, this looks promising; what king of object is the input `x` to `f` if using apply? This is unpacked but I don't understand what we're passing in there. The `pd.Series` of `df`?! – Ivo Jul 15 '20 at 13:27
  • `pd.Series.map` applies the function `f` on each value of the series. So, `df['country'].str.cat(df['variable'],sep='-').map(f)` applies function on each value which is a string. @Ivo – Ch3steR Jul 15 '20 at 13:31
  • `df.apply` over axis 1 applies function `f` on each of the row values i.e function `f` is applied on `('AA', 'foo/tball')` and similarly for each row. – Ch3steR Jul 15 '20 at 13:33