1

I want to run a function on rows of a pandas dataframe in list comprehension. Dataframe can have varying number of columns. How to make use these columns of dataframe?

import  pandas as pd

df = {'chrom': ['chr1', 'chr1','chr1'], 'start': [10000, 10100, 12000], 'end':[10150,10120,12250], 'S1':[1, 1, 1],'S2':[2, 2, 2],'S3':[3, 3, 3] }
df = pd.DataFrame(data=df)
print(df)

def func(row):
    print(row)


[func(row) for row in zip(df['chrom'],df['start'],df['S1'],df['S2'],df['S3'])]

How to do this in a memory efficient way? So that we do not get any memory error for big dataframes.

questionto42
  • 7,175
  • 4
  • 57
  • 90
burcak
  • 1,009
  • 10
  • 34
  • Depends on how you want the output. Seems like the way you're doing it would be among the most efficient possible ways (since `zip()` produces a generator, effectively) - is there a particular problem you're running into here? – Green Cloak Guy Oct 26 '19 at 01:51
  • Yes, the number of colums starting with 'S' is not constant. There can be 30 columns starting from 'S1' ... to 'S30' or 60 columns starting from 'S1' ... to 'S60'. – burcak Oct 26 '19 at 02:33
  • Since number of columns is a variable, I used df[list(df.columns.values)].values() but this gives MemoryError – burcak Oct 26 '19 at 02:34
  • Also using df[list(df.columns.values)].to_numpy(copy=False) gives MemoryError – burcak Oct 26 '19 at 02:36
  • Possible duplicate of https://stackoverflow.com/questions/52607864/pandas-list-comprehension-tuple-from-dataframe/62064822#62064822 and https://stackoverflow.com/questions/40646458/list-comprehension-in-pandas/62062095#62062095 – questionto42 May 28 '20 at 13:07
  • Does this answer your question? [list comprehension in pandas](https://stackoverflow.com/questions/40646458/list-comprehension-in-pandas) – questionto42 May 28 '20 at 13:19
  • Does this answer your question? [What is the most efficient way to loop through dataframes with pandas?](https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas) – questionto42 Jul 08 '22 at 18:06

4 Answers4

2

The shown code is extremely memory efficient, and should be faster than an iterrow based solution.

But from your comment, it is not the code that causes the memory error... The problematic codes are:

df[list(df.columns.values)].values()

or:

df[list(df.columns.values)].to_numpy(copy=False)

because both involves a full copy of the dataframe values unless all columns have the same dtype.

If you want to process an unknown number of columns, the safe way is:

[func(row) for row in zip([df[i].values for i in df.columns])]

No copy is required here because df[i].values will return the underlying numpy arrays.


By the way, if you only need to use once the values of the returned list you could even save some memory by using a generator instead of a list:

(func(row) for row in zip([df[i].values for i in df.columns]))
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
1

Thanks for your answers.

Meantime, I found the following as a solution:

df_columns = list(df.columns.values)
[func_using_list_comp(
                row,
                var1,
                var2,
                var3,
                ...,
                df_columns) for row in df[df_columns].values]

In this way, I did not need to use zip function and make it work for any number of columns.

I hope this is also memory efficient. By the way, I'm accumulating in the var1, var2, var3 each time I process a row.

If I use generator instead of a list, how much will it affect my memory usage and will I get the all the accumulated data after processing all rows?

Since, I'm returning these var1, var2, var3 after all rows are processed.

burcak
  • 1,009
  • 10
  • 34
0

Your list comprehension method seems a bit more confusing than it needs to be, especially considering pandas dataframes have an iterrows() method. You can replace your version with this:

for index, row in df.iterrows():
    func(row)

But I only suggest the above method because your function seems to only print out the row. Depending on what your func really does, you may want to consider using df.apply():

df.apply(func, axis=1)
anair
  • 11
  • 2
  • 1
    List comprehension is faster than apply, therefore I want to use list comprehension. – burcak Oct 26 '19 at 02:58
  • The post here would disagree with your belief that list comprehensions are always faster than an apply/map. https://stackoverflow.com/a/40057151/12274459 If your function can be vectorized, it will almost certainly be faster in an apply/map. Another piece of evidence against list comprehensions being faster than apply/map: https://stackoverflow.com/a/43677631/12274459 – anair Oct 26 '19 at 03:29
  • 1
    unfortunately, it can not be vectorized. Have a look. https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas – burcak Oct 26 '19 at 03:43
0

In your example, printing the full row, the [0] or * is simply to remove the numpy frame again:

[func(*row) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

or

[func(row[0]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

['chr1' 10000 1 2 3]
['chr1' 10100 1 2 3]
['chr1' 12000 1 2 3]

printing only the third column:

[func(row[0][2]) for row in zip(df[['chrom','start','S1','S2','S3']].to_numpy())]

1
1
1

p.s.: this also has the console output [None, None, None] in the end, but that is just because the result of print() inside the list comprehension is None, it does not belong to the print results.

Further reading:

EDIT:

Please use df.iloc and df.loc instead of df[[...]], see Selecting multiple columns in a Pandas dataframe

questionto42
  • 7,175
  • 4
  • 57
  • 90