3

Is there a Pandas solution—e.g.: with numba, or Cython—to transform/apply with an index?

I know I could use iterrows, itertuples, iteritems or items. But what I want to do should be trivial to vectorize… I've built a simple proxy to my actual use-case (runnable code):

df = pd.DataFrame(
    np.random.randn(8, 4),
    index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])

namednumber2numbername = {
    'one': ('zero', 'one', 'two', 'three', 'four',
            'five', 'six', 'seven', 'eight', 'nine'),
    'two': ('i',    'ii',  'iii', 'iv',    'v',
            'vi',   'vii', 'viii',  'ix',    'x')
}

def namednumber2numbername_applicator(series):        
    def to_s(value):
        if pd.isnull(value) or isinstance(value, string_types): return value
        value = np.ushort(value)
        if value > 10: return value

        # TODO: Figure out idx of `series.name` at this `value`… instead of `'one'`

        return namednumber2numbername['one'][value]

    return series.apply(to_s)

df.transform(namednumber2numbername_applicator)

Actual output

             0      1      2      3
bar one   zero   zero    one  65535
    two   zero   zero   zero   zero
baz one   zero   zero   zero   zero
    two   zero    two   zero   zero
foo one  65535   zero   zero   zero
    two   zero  65535  65534   zero
qux one   zero    one   zero   zero
    two   zero   zero   zero   zero

Output I want

             0      1      2     3
bar one   zero   zero    one  65535
    two      i      i      i      i
baz one   zero   zero   zero   zero
    two      i    iii      i      i
foo one  65535   zero   zero   zero
    two      i  65535  65534      i
qux one   zero    one   zero   zero
    two      i      i      i      i

Possibly related: How to query MultiIndex index columns values in pandas

Essentially I'm looking for the same behaviour as JavaScript's Array.prototype.map (which passes along the idx).

A T
  • 13,008
  • 21
  • 97
  • 158

5 Answers5

3

I wrote a very fast version of the transform to get these results. You can do the np.ushort inside the generator as well, and it's still fast but much faster outside:

import time
df = pd.DataFrame(
    np.random.randn(8, 4**7),
    index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])

start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)

# 1.150895118713379

Here's the time's on the original:

df = pd.DataFrame( np.random.randn(8, 4),
     index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']), 
           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]) 

start = time.time() 
df.loc[:,] = np.ushort(df) 
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1) 
end = time.time() 
print(end - start)                                                                                                                                                                   
# 0.005067110061645508

In [453]: df                                                                                                                                                                                   
Out[453]: 
             0     1      2     3
bar one   zero  zero    one  zero
    two      i     i      i     i
baz one   zero  zero   zero  zero
    two      i     i     ii     i
foo one  65535  zero  65535  zero
    two      i     i      i     i
qux one   zero  zero   zero  zero
    two      i     i      i    ii

I got it to a one liner:

df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 10 else namednumber2numbername[pos[1]][np.ushort(value)] for pos, value in x.items()])                              

             0     1      2     3
bar one   zero  zero   zero  zero
    two      i     i     ii     i
baz one  65534  zero  65535  zero
    two     ii     i  65535     i
foo one   zero  zero   zero  zero
    two     ii     i      i    ii
qux one  65535  zero   zero  zero
    two      i     i      i     i

Ok a version without .items():


def what(x): 
   if type(x[0]) == np.float64: 
      if np.ushort(x[0])>10: 
         return np.ushort(x[0]) 
      else: 
         return(namednumber2numbername[x.index[0][1]][np.ushort(x[0])]) 

df.groupby(level=[0,1]).transform(what)

            0     1      2      3
bar one  zero   one   zero   zero
    two     i    ii  65535      i
baz one  zero  zero  65535   zero
    two     i     i      i      i
foo one  zero   one   zero   zero
    two     i     i      i      i
qux one   two  zero   zero  65534
    two     i     i      i     ii

and one liner!!!! no .items per your request! We groupby Levels 0 and 1 and then perform the calculations to determine the values::

df.groupby(level=[0,1]).transform(lambda x: np.ushort(x[0]) if type(x[0]) == np.float64 and np.ushort(x[0]) >10 else namednumber2numbername[x.index[0][1]][np.ushort(x[0])])

            0     1      2      3
bar one  zero   one   zero   zero
    two     i    ii  65535      i
baz one  zero  zero  65535   zero
    two     i     i      i      i
foo one  zero   one   zero   zero
    two     i     i      i      i
qux one   two  zero   zero  65534
    two     i     i      i     ii

To get the other values i did this:

df.transform(lambda x: [ str(x.name[0]) + '_' + str(x.name[1]) + '_' + str( pos)+ '_' +str(value) for pos,value in x.items()])

print('Transformed DataFrame:\n',
      df.transform(what), sep='')

Transformed DataFrame:
                             α                                                        ...                          ω                                                       ε
f                            a                          b                          c  ...                          b                           c                           j
one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303

Here's one without .items:

df.transform(lambda x: ['_'.join((x.name[0], x.name[1], x.index[0], str(i) if type(i) == float else 0)) for i in list(x)]) 

output

                             α                                                        ...                          ω                                                       ε
f                            a                          b                          c  ...                          b                           c                           j
one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303

I did it also with no groupings:

df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + df.T.eq(x).columns + '_' + x.astype(str) ,  axis=1).T

or even better and most simple:

df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 

or 

df.T.transform(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) ,  axis=1).T 

or with no .T:

df.transform(lambda x: x.index[0][0] + '_'+ x.index[0][1] + '_' + x.name + '_' + x.astype(str) ,  axis=1) 
                             α                                                        ...                          ω                                                       ε
f                            a                          b                          c  ...                          b                           c                           j
one  α_a_one_79.96465755359696  α_b_one_31.32938096131651   α_c_one_2.61444370203201  ...   ω_b_one_35.7457972161041  ω_c_one_40.224465043054195  ε_j_one_43.527184108357496
two  α_a_two_42.66244395377804  α_b_two_65.92020941618344  α_c_two_77.26467264185487  ...  ω_b_two_40.91908469505522  ω_c_two_50.395561828234555   ε_j_two_71.67418483119914
one   α_a_one_47.9769845681328  α_b_one_38.90671671550259  α_c_one_67.13601594352508  ...  ω_b_one_23.23799084164898  ω_c_one_63.551178212994465  ε_j_one_16.975582723809303
oppressionslayer
  • 6,942
  • 2
  • 7
  • 24
  • Thanks… but wouldn't that be really inefficient and anti-numpy/anti-pandas in its implementation? - I'm talking about `values` as a simple Python list, and appending it. Sure, it could trivially be rewritten as a comprehension, and if left as a generator could be given as input to a numpy array. But is that not still really 'wrong' pandas/numpy code? - The `oneortwo` and the `for value in x` would keep things thoroughly sequential and non parallelizable. – A T Dec 22 '19 at 09:56
  • For reference, here's a version of yours I just wrote into a list comprehension: https://ideone.com/b5iQT7 – A T Dec 22 '19 at 10:16
  • @AT one liner! i'm sure it can be edited to deal with any other issues you may have. But that's great right! – oppressionslayer Dec 22 '19 at 10:56
  • Thanks, good to see a one liner! - Any change that can be rewritten without an `items()` as per my original question? – A T Dec 22 '19 at 11:00
  • 2
    I've spent a lot of time on this so my brain hurts, lol, but maybe, let me think about it. – oppressionslayer Dec 22 '19 at 11:03
  • @AT Check the newest one out, i'm sure it can be made into a one liner, but really, its going to be the same speed. But check it out – oppressionslayer Dec 22 '19 at 12:05
  • Thanks good to see. Unfortunately I can't figure it out on my actual shape and naming conventions—including MultiIndex hierarchies—of my `DataFrame`: https://ideone.com/nhzK5F. I tried to simplify it in my StackOverflow question, but I didn't realise that in simplifying it the answers wouldn't solve my actual problem :\ - Please see if you can tell me how to get where in the DataFrame the cell the cursor is currently on. I need to know both it's row label (`"one"` or `"two"`) as well as its column labels (both the greek letter and the latin letter). – A T Dec 24 '19 at 01:51
  • @AT Are you available for chat, i got this working with the new format, but i want to touch base with you so i can add a solution that we can to the solution – oppressionslayer Dec 24 '19 at 04:51
  • For example, this works on your new data: df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 9 else namednumber2numbername[pos][np.ushort(value)] for pos, value in x.items()]) – oppressionslayer Dec 24 '19 at 05:05
  • Yes I'm available to chat (here or as `AlecTaylor` on Freenode) – A T Dec 24 '19 at 05:07
  • @AT I created this room: https://chat.stackoverflow.com/rooms/204764/axandoppressionslayer My name is Will btw I just created a freenode, i'm oppressionslayer – oppressionslayer Dec 24 '19 at 05:13
  • I would replace your latest answer with `df.transform(lambda x: ['_'.join((x.name[0], x.name[1], pos, str(value)) for pos,value in x.items()])`. But any way of doing that without the `.items()` call? – A T Dec 24 '19 at 06:26
  • Looking good! - FYI: A clean simplification of the casting and concatenation is `.transform(lambda x: tuple('_'.join((x.name[0], x.name[1], x.index[0], str(i if type(i) == float else 0))) for i in x))`. - I'm going to hold off accepting immediately, in case there is a trick like @Stef used with the `args=` thus avoiding a `groupby`. – A T Dec 24 '19 at 09:22
  • 1
    I added the join. I will look for a solution like that too if you don't mind. Also, one thing i noticed about that solution is that it is very cumbersome, try adding prints in there, it reloops many times, but maybe i can figure out a way so that it loops once – oppressionslayer Dec 24 '19 at 09:22
  • @AT Think i did it, no groupings, straight apply or transform – oppressionslayer Dec 24 '19 at 11:49
  • @AT check the top of my answer, i wrote the fastest transform that i've seen here to accomplish your task. it's a one liner and i used a very large data set to test it on, you can test the others as well, but i think due to me skipping a second apply, it's the fastest ( and i tested to confirm ) – oppressionslayer Dec 25 '19 at 02:44
  • Nicely done! - Accepting and bountifying! - I did make a small edit though: `df.transform(lambda x: '_'.join(map(str, (x.index[0][0], x.index[0][1], x.name))) + '_' + x.astype(str), axis=1)` – A T Dec 25 '19 at 06:18
  • 1
    Thanks!!! if you need anything else, let me know, i've had fun working on this one. I wish there were more like it. – oppressionslayer Dec 25 '19 at 06:20
  • Thanks, yeah I'm currently debugging an issue with my real dataset (`x.index` is the same for each column… but correctly different for each row) – A T Dec 25 '19 at 11:22
  • Got it to work with `df.T.apply(lambda x: '_'.join(x.name + (str(x.index[0]),)) + '_' + x.astype(str), axis=1).T` – A T Dec 25 '19 at 12:33
  • Got it to work with `df.T.apply(lambda x: '::'.join(x.name + (str(x.index[0]),)) + '::' + x.astype(str), axis=1).T.apply(lambda series: series.apply(h, args=((series.name),)), axis=1)`, inside `def h` I split on `::` to parse out the cell. Weird that I couldn't get the syntax to work directly (the last arg always ended up a `pd.Series` rather than a `str` when done that way). – A T Dec 25 '19 at 14:04
  • Scratch that, ended up using `df.transform(lambda x: ['_'.join((x.name[0], x.name[1], str(pos), str(value))) for pos, value in x.items()])`, everything else repeated one of the arguments on every cell (I checked). – A T Dec 26 '19 at 01:50
3

Transform per default applies the function to each column. You can instead apply it to each row specifying the axis parameter = 1 or 'columns'. Then you have access to the row index and can pass its second name field to your function:

    def namednumber2numbername_applicator(series):        
        def to_s(value, name):
            if pd.isnull(value): return value
            value = np.ushort(value)
            if value > 10: return value

            return namednumber2numbername[name][value]

        return series.apply(to_s, args=((series.name[1]),))

df.transform(namednumber2numbername_applicator, 1)

Result:

             0      1      2      3
bar one  65535   zero   zero  65535
    two     ii      i    iii  65535
baz one  65535   zero   zero  65535
    two      i      i  65535      i
foo one   zero   zero   zero   zero
    two      i  65535      i      i
qux one   zero   zero   zero  65535
    two      i      i      i      i
Stef
  • 28,728
  • 2
  • 24
  • 52
  • Thanks, that's looking clean. I've applied it to my real dataset, but am now confused at how to get, for—e.g.: coordinates baz, two, 1 (`i` value)—all three identifiers (`baz`, `two`, `1`). How do I get these? – A T Dec 23 '19 at 11:42
  • IIUC you need `name[0]` (`baz`), as you already have `name[1]` (`two`) and value(`1`): just use `return series.apply(to_s, args=((series.name),))` then you can access `name[0]` and `name[1]` inside your `to_s` function. If this is not what you need please rephrase your question. – Stef Dec 23 '19 at 11:50
  • Hi @Stef - here is an example with the exact shape and naming conventions—including MultiIndex hierarchies—of my `DataFrame`: https://ideone.com/nhzK5F. I tried to simplify it in my StackOverflow question, but I didn't realise that in simplifying it the answers wouldn't solve my actual problem :\ - Please see if you can tell me how to get where in the DataFrame the cell the cursor is currently on. I need to know both it's row label (`"one"` or `"two"`) as well as its column labels (both the greek letter and the latin letter). – A T Dec 24 '19 at 01:50
2

here is another way using reindex and np.where():

def myf(dataframe,dictionary):
    cond1=dataframe.isna()
    cond2=np.ushort(dataframe)>10
    m=(pd.DataFrame.from_dict(dictionary,orient='index')
                          .reindex(dataframe.index.get_level_values(1)))
    m.index=pd.MultiIndex.from_arrays((dataframe.index.get_level_values(0),m.index))
    arr=np.where(cond1|cond2,np.ushort(dataframe),
                                 m[m.columns.intersection(dataframe.columns)])
return pd.DataFrame(arr,dataframe.index,dataframe.columns)

myf(df,namednumber2numbername)

             0      1      2      3
bar one   zero    one    two  three
    two  65535     ii    iii  65535
baz one   zero    one  65535  three
    two      i     ii    iii     iv
foo one   zero  65535    two  three
    two      i     ii    iii     iv
qux one   zero  65535    two  65535
    two      i     ii    iii     iv

Steps followed:

  • This function creates a dataframe with the dictionary (m) and reindexes ad the original.
  • Post this, we are adding an extra level to make it a multiindex same as the original dataframe. (print m inside func to see m)
  • Then we check condition if dataframe is Null or np.ushort value more than 10
  • If condition matches, return np.ushort of dataframe else values from matching columns from m.

Let me know if there are any steps I have missed to check , or you want to to incorporate as I feel this is one way to avoid row wise computation.

anky
  • 74,114
  • 11
  • 41
  • 70
1

An example using Series.map:

class dict_default_key(dict):
    def __missing__(self, key):
        return key


number_names = [
    'zero',
    'one',
    'two',
    'three',
    'four',
    'five',
    'six',
    'seven',
    'eight',
    'nine'
]
roman_numerals = [
    'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x'
]
name_mapping = {
    'one': dict_default_key(
        {c: v for c, v in enumerate(number_names)}
    ),
    'two': dict_default_key(
        {c: v for c, v in enumerate(roman_numerals)}
    )
}

def translate(series):
    key = series.name[1]
    row_map = name_mapping[key]
    result = series.map(row_map)
    return result

ushorts = df.apply(np.ushort)
ushorts.apply(translate, axis=1)
Meow
  • 1,207
  • 15
  • 23
  • 1
    Thanks, that works. I've applied it to my real dataset, but am now confused at how to get, for—e.g.: coordinates `baz`, `two`, `1` (`i` value)—all three identifiers (`baz`, `two`, `1`). How do I get these? – A T Dec 23 '19 at 12:36
0

Here's how I would go about solving this:

# 1. Rewrite functions to include a parameter for `idx`
def some_fun_name(value, idx):  
    value = np.ushort(value)
    if value > 10: 
        return value
    else:
        return namednumber2numbername[idx][value]

def apply_some_fun_name(s):  
    idx = list(s.index.get_level_values(1).unique())[0]
    return s.transform(some_fun_name, idx=idx)

# 2. Apply function over the keys of the multi-index, replacing while operating:
df = df.groupby(level=1).transform(apply_some_fun_name)

# 3. I got the following result while using `np.random.seed(1)`:
             0      1     2      3
bar one    one   zero  zero  65535
    two      i  65534    ii      i
baz one   zero   zero   one  65534
    two      i      i    ii  65535
foo one   zero   zero  zero   zero
    two  65535     ii     i      i
qux one   zero   zero  zero   zero
    two      i      i     i      i

Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69