2

I am trying to generate a new column in a pandas dataframe by loop over >100,000 rows and setting the value of the row conditional on an already existing row.

The current dataframe is a dummy but works as an example. My current code is:

df=pd.DataFrame({'IT100':[5,5,-0.001371,0.0002095,-5,0,-5,5,5],
'ET110':[0.008187884,0.008285232,0.00838258,0.008479928,1,1,1,1,1]})

# if charging set to 1, if discharging set to -1.
# if -1 < IT100 < 1 then set CD to previous cells value
# Charging is defined as IT100 > 1 and Discharge is defined as IT100 < -1


 def CD(dataFrame):


    for x in range(0,len(dataFrame.index)):
     
        current = dataFrame.loc[x,"IT100"]

        if x == 0:
            if dataFrame.loc[x+5,"IT100"] > -1:
                dataFrame.loc[x,"CD"] = 1
            else:
                dataFrame.loc[x,"CD"] = -1
        else:
            if current > 1:
                dataFrame.loc[x,"CD"] = 1
            elif current < -1:
                dataFrame.loc[x,"CD"] = -1
            else:
                dataFrame.loc[x,"CD"] = dataFrame.loc[x-1,"CD"]

Using if/Else loops is extremely slow. I see that people have suggested to use np.select() or pd.apply(), but I do not know if this will work for my example. I need to be able to index the column because one of my conditions is to set the value of the new column to the value of the previous cell in the column of interest.

Thanks for any help!

NKUP
  • 21
  • 3
  • 2
    It's not the `if/else` part that's slow. Just FYI – Grajdeanu Alex Dec 28 '20 at 23:23
  • Please post an example dataframe, just the columns of interest and, say, a dozen rows. – tdelaney Dec 28 '20 at 23:31
  • 2
    Have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] so that we cna better understand how to help – G. Anderson Dec 28 '20 at 23:32
  • Any time you explicitly iterate through your data frame rows, you're slowing down the processing. Please work through more PANDAS tutorials to get familiar with vectored operation. In this case, you can use the `shift` method to handle the `x-1` indexing as a vectored operation. – Prune Dec 28 '20 at 23:33
  • In general, pandas doesn't have a way to handle recursive definitions, but in this case as you're not really doing any modifications on the previous values, `ffill` works here – Asish M. Dec 28 '20 at 23:46
  • @prune do you have any suggestions? I was trying to work throuhg some but could not figure out what to search. I was going through some using lambda functions, others with np.select() but could not reference specific cell indexes with those. If you have any ideas I will definitely dive into them – NKUP Dec 29 '20 at 00:04
  • @G.Anderson Thanks for the note! I added some additional data that will hopefully help. My current code works but it just takes a long time when the number of rows increases to the size of my actual data set – NKUP Dec 29 '20 at 00:09

3 Answers3

3

@Grajdeanu Alex is right, the loop is slowing you down more than whatever you're doing inside of it. With pandas, a loop is usually the slowest choice. Try this:

import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = np.nan
#lower saturation
df.loc[df['IT100'] < -1,['CD']] = -1
#upper saturation
df.loc[df['IT100'] > 1,['CD']] = 1
#fill forward
df['CD'] = df['CD'].ffill()
# setting the first row equal to the fifth
df.loc[0,['CD']] = df.loc[5,['CD']]

using ffill will use the last valid value to fill in subsequent nan values (-1 < x < 1)

EMiller
  • 817
  • 1
  • 7
  • 20
  • the OP's dataframe has over 100k rows, so I think you're misinterpreting the significance of the `x+5` – Paul H Dec 28 '20 at 23:55
  • 1
    unless they miswrote - their code only looks at `x+5` if `x==0` (x being the index here which "should" be unique) – Asish M. Dec 28 '20 at 23:57
  • For the X==0 cell, I wanted to check what the value was for the 5th or so cell because I knew my signal would be stable by then. – NKUP Dec 28 '20 at 23:59
0

Similar to EMiller's answer, you could also use clip.

import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})

df['CD'] = df['IT100'].clip(-1, 1)
df.loc[~df['CD'].isin([-1, 1]), 'CD'] = np.nan
df['CD'] = df['CD'].ffill()
df.loc[0,['CD']] = df.loc[5,['CD']]
Jack Moody
  • 1,590
  • 3
  • 21
  • 38
0

As an alternate to @EMiller's answer

In [213]: df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})

In [214]: df
Out[214]:
   IT100
0   0.00
1 -50.00
2 -20.00
3  -0.50
4  -0.25
5  -0.50
6 -10.00
7   5.00
8   0.50

In [215]: df['CD'] = pd.Series(np.where(df['IT100'].between(-1, 1), np.nan, df['IT100'].clip(-1, 1))).ffill()


In [217]: df.loc[0, 'CD'] = 1 if df.loc[5, 'IT100'] > -1 else -1

In [218]: df
Out[218]:
   IT100   CD
0   0.00  1.0
1 -50.00 -1.0
2 -20.00 -1.0
3  -0.50 -1.0
4  -0.25 -1.0
5  -0.50 -1.0
6 -10.00 -1.0
7   5.00  1.0
8   0.50  1.0
Asish M.
  • 2,588
  • 1
  • 16
  • 31