0

I'm trying to build a machine learning algorithm for my job. The data I'm using for training and testing has 17k rows and 20 columns. I've tried adding a new column based on two other columns but the for loop that I've written is too slow (3 seconds to be executed)

for i in range(0, len(model_olculeri)):
    if (model_olculeri["Bel"][i] != 0) and (model_olculeri["Basen"][i] != 0):
        sum_column = (model_olculeri["Bel"][i]) / (model_olculeri["Basen"][i])
        model_olculeri["Waist to Hip Ratio"][i] = sum_column

I read articles about pandas and numpy vectorization instead of for loop on pandas dataframes and it seems like it is so much faster and effective. How can I implement this kind of vectorization for my for loop? Thanks a lot.

Samet
  • 11
  • 2
  • Yes, looping over each row is generally slow, especially if you have an operation (in this case, division) that you want to apply on an entire column. – Joshua Voskamp Oct 25 '21 at 13:48
  • https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing – Riley Oct 25 '21 at 13:52

2 Answers2

1

Create boolean mask and use it for filtering:

m = (model_olculeri["Bel"] != 0) & (model_olculeri["Basen"] != 0)
model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri.loc[m, "Bel"] / model_olculeri.loc[m,"Basen"]

Alternative:

model_olculeri.loc[m,"Waist to Hip Ratio"] = model_olculeri["Bel"] / model_olculeri["Basen"]

Or set new value in numpy.where:

model_olculeri["Waist to Hip Ratio"] = np.where(m, model_olculeri["Bel"] / model_olculeri["Basen"], np.nan)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Chained solution using query and pipe

model_olculeri.query("Bel != 0 & Basen != 0").pipe(lambda x:x.assign(Waist to Hip Ratio =  x.Bel/x.Basen)