0

I have a pandas dataframe where the column ARRIVAL_DELAY is filled with different values (float64). I want to insert a new column COMPENSATION which will be filled depending on the values from ARRIVAL_DELAY.

I want it to look like this:

ARRIVAL_DELAY      COMPENSATION
      10                0
      25                0
      43                50
      61                250

I came up with the code below but it just seems that is not working at all (it is taking literally hours on Jupyter notebooks and it is not even completing anything). Also, no errors nor warnings are displayed:

fdf.insert(13, 'COMPENSATION', 0)

compensacion = [0, 50, 100, 250, 500, 1000]

for row in fdf['ARRIVAL_DELAY']:
    if row > 0 and row <= 15 : fdf['COMPENSATION'].add(compensacion[0])
    
    elif row > 15 and row <= 30 : fdf['COMPENSATION'].add(compensacion[1])
    
    elif row > 30 and row <= 60 : fdf['COMPENSATION'].add(compensacion[2])
    
    elif row > 60 and row <= 120 : fdf['COMPENSATION'].add(compensacion[3])
    
    elif row > 120 and row <= 180 : fdf['COMPENSATION'].add(compensacion[4])
    
    else :fdf['COMPENSATION'].add(compensacion[5])
          
fdf.head(10)

I do not understand what's wrong, any ideas?

Finally, I am kind of new to Python so if anyone has an improvement idea, it will be more than welcomed

Thank you!

imark
  • 3
  • 2

2 Answers2

0

This can be accomplished using np.select which is optimized as well as makes your code look a little neater and readable

import pandas as pd
import numpy as np


df = pd.DataFrame({
    'ARRIVAL_DELAY' : [10, 25, 30, 61]
})
condition_list = [
    df['ARRIVAL_DELAY'].between(1, 15),
    df['ARRIVAL_DELAY'].between(16, 30),
    df['ARRIVAL_DELAY'].between(31, 60),
    df['ARRIVAL_DELAY'].between(61, 120),
    df['ARRIVAL_DELAY'].between(121, 180)
]

choice_list = [0, 50, 100, 250, 500]
df['COMPENSATION'] = np.select(condition_list, choice_list, 1000)
df
ArchAngelPwn
  • 2,891
  • 1
  • 4
  • 17
0

Try to avoid iterating over the rows of your pandas dataframe. In fact, do it only as a last resort. Pandas and numpy provide many vecotorized functions that are highly optimized. Iterating over large dataframes could be excruciatingly slow. (Read this to understand more.)

numpy.select is an excellent option to solve your problem as provided by @ArchAngelPwn. An alternative is to use the pandas cut() which can bin continuous values to discrete intervals, and also is very efficient.

df = pd.DataFrame([10, 25, 43, 61], columns=['ARRIVAL_DELAY'])
df['COMPENSATION'] = pd.cut(df.ARRIVAL_DELAY, 
                           [0, 15, 30, 60, 120, 180, np.inf], 
                           labels=[0, 50, 100, 250, 500, 1000])
print(df)

   ARRIVAL_DELAY COMPENSATION
0             10            0
1             25           50
2             43          100
3             61          250
viggnah
  • 1,709
  • 1
  • 3
  • 12