4

This is just a nitpicking syntactic question...

I have a dataframe, and I want to use list comprehension to evaluate a function using lots of columns.

I know I can do this

df['result_col'] = [some_func(*var) for var in zip(df['col_1'], df['col_2'],... ,df['col_n'])]

I would like to do something like this

df['result_col'] = [some_func(*var) for var in zip(df[['col_1', 'col_2',... ,'col_n']])]

i.e. not having to write df n times. I cannot for the life of me figure out the syntax.

aynber
  • 22,380
  • 8
  • 50
  • 63
mortysporty
  • 2,749
  • 6
  • 28
  • 51
  • 1
    try this `df['result_col'] = [some_func(*var) for var in zip(*df[col for col in ['col_1', 'col_2',... ,'col_n']])]`? – deadvoid Oct 02 '18 at 12:07
  • 1
    Why dont you just use apply: `df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)` ? – gyx-hh Oct 02 '18 at 12:10
  • 1
    @cryptonome seems to be a syntax error somewhere.. missing a bracket or paranthesis? – mortysporty Oct 02 '18 at 12:16
  • 1
    @gyx-hh i thought apply was slow. But honestly I didnt even consider it – mortysporty Oct 02 '18 at 12:17
  • 1
    @cryptonome `df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]` worked. thanks. If you want credit for your answer, post it and I'll tag it as the answer. – mortysporty Oct 02 '18 at 12:23
  • @mortysporty perhaps, i didn't try to check the amount of brackets/parens in it. alright then, i'll post an answer, and check the typo first :) – deadvoid Oct 02 '18 at 12:28
  • @mortysporty when you apply you're just looping through the dataframe just like what you're doing with list comprehension . – gyx-hh Oct 02 '18 at 12:33
  • Possible duplicate of https://stackoverflow.com/questions/40646458/list-comprehension-in-pandas/62062095#62062095 and https://stackoverflow.com/questions/58567199/memory-efficient-way-for-list-comprehension-of-pandas-dataframe-using-multiple-c/62064720#62064720 (though the latter link is younger, so only now it has become a possible duplicate) – questionto42 May 28 '20 at 13:10
  • 1
    @gyx-hh df.apply(), df.itertuples(), df.iteritems(), df.iterrows() are much slower than list comprehension, not recommended, your comment is wrong, apply and list comprehension are not at all equal in speed – questionto42 Jun 25 '20 at 11:36
  • Thanks for that @Lorenz, was not aware of that - i'm aware df.apply and those methods are slow in general because we are looping, and you should always try to find a different approach that is vectorised - but was not aware list comprehension is faster than df.apply – gyx-hh Jun 26 '20 at 12:18
  • Does this answer your question? [list comprehension in pandas](https://stackoverflow.com/questions/40646458/list-comprehension-in-pandas) – questionto42 Jul 08 '22 at 17:44

3 Answers3

4

this should work, but honestly, OP figured it himself as well, so +1 OP :)

df['result_col'] = [some_func(*var) for var in zip(*[df[col] for col in ['col_1', 'col_2',... ,'col_n']])]
deadvoid
  • 1,270
  • 10
  • 19
2

As mentioned in the comments above, you should use apply instead:

df['reult_col'] = df.apply(lambda x: some_func(*tuple(x.values)), axis=1)
gyx-hh
  • 1,421
  • 1
  • 10
  • 15
2

df.apply() is almost as slow as df.iterrows(), both are not recommended, see How to iterate over rows in a DataFrame in Pandas --> search for "An Obvious Example" of @cs95a and see the comparison graph. As the fastest ways (vectorization, Cython routines) are not easy to implement, the 3rd best and thus usually best solution is list comprehension:

# print 3rd col
def some_func(row):
    print(row[2])


df['result_col'] = [some_func(*row) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

or

# print 3rd col
def some_func(row):
    print(row[2])

df['result_col'] = [some_func(row[0]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

or

# print 3rd col
def some_func(x):
    print(x)

df['result_col'] = [some_func(row[0][2]) for row in zip(df[['col_1', 'col_2',... ,'col_n']].to_numpy())]

Further reading:

EDIT:

Please use df.iloc and df.loc instead of df[[...]], see Selecting multiple columns in a Pandas dataframe

questionto42
  • 7,175
  • 4
  • 57
  • 90