how to avoid using loop in pandas

Question

today = pd.Timestamp.today()  
for x in range(len(df)):
    #trace back
    if df.loc[x,'date_2022'] is pd.NaT and df.loc[x,'date_2021'] is not pd.NaT:
    # extract month and day   
        d1 = today.strftime('%m-%d')  
        d2 = df.loc[x,'date_2021'].strftime('%m-%d')  

    # convert to datetime
        d1 = datetime.strptime(d1, '%m-%d')  
        d2 = datetime.strptime(d2, '%m-%d')  

    # get difference in days 
        diff = d1 - d2
        days = diff.days
    #range 14 days
        if days > 14:
            df.loc[x,'inspection'] = 'check'
        else:
            df.loc[x,'inspection'] = np.nan

my aim is to add an inspection column, the condition is if the cell in 2022 is null(pd.NaT) but last year is not null, and it has past 14 days since the last year's date, how can I write it without using loop?

Do you really have two dataframes (`df` and `esgReport2021`) or is it just a *typo* ? Either ways, can you make a [minimal-reproducible-example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and show your expected output ? — Timeless, May 15 '23 at 06:53

jezrael · Answer 1 · 2023-05-15T07:05:34.150

1

Use Timestamp.strftime and Series.dt.strftime with to_datetime for datetimes, for test missing values use Series.isna, chain with conditon for test difference of days by Series.dt.days and create new column in numpy.where:

d1 = pd.to_datetime(pd.Timestamp.today().strftime('%m-%d'), format='%m-%d')
d2 = pd.to_datetime(df['date_2021'].dt.strftime('%m-%d'), format='%m-%d')

m = df['date_2022'].isna()

df['inspection'] = np.where(((d1 - d2).dt.days > 14) & m, 'check', np.nan)

edited May 15 '23 at 07:05

answered May 15 '23 at 06:34

jezrael

822,522
95
1,334
1,252

1

how to subtract two dates ignoring the year? I use strftime to just keep every date's month and day info – Rinne Tsujikubo May 15 '23 at 07:02
@RinneTsujikubo - I understand, answer was edited - I use same solution like in your question. – jezrael May 15 '23 at 07:06

score 0 · Answer 2 · answered May 15 '23 at 08:18

Well, you cannot move away from using loops in Pandas, even in the answer given by @jezrael looping is abstracted by using pandas' built-in methods. A much more elaborate approach would be to use pandas.DataFrame.apply and abstract all your code in a method, something like this.

Using your exact code- Firstly, I noticed you are using another df so I think you might want to merge/ Join the two reports to get the columns in the same dataframe. In the example below though, I have kept your approach as the primary.

def perform_inspection(row, today, esgReport2021):
    #trace back
    if row['date_2022'] is pd.NaT and esgReport2021.at[row.name,'date_2021'] is not pd.NaT:
        # get the modified date
        old_month = esgReport2021.at[row.name,'date_2021'].month
        old_day = esgReport2021.at[row.name,'date_2021'].day
        old_modified_date = datetime.date(today.year, old_month, old_day)
        # get difference in days
        diff = today - old_modified_date
        days = diff.days
        #range 14 days
        if days > 14:
            row['inspection'] = 'check'
    return row

today = pd.Timestamp.today().date()
df["date_2022"] = pd.NaT    #Assuming this is the 0th day of your report processing.
df["inspection"] = pd.NaT
df = df.apply(perform_inspection, axis=1, args = (today, esgReport2021,))

Apply method will take care of one or more "row" level operations by passing the row itself as the first argument as you can see in the methods definition.

how to avoid using loop in pandas

2 Answers2