2

I am working on a huge dataframe and trying to create a new column, based on a condition in another column. Right now, I have a big while-loop and this calculation takes too much time, is there an easier way to do it?

With lambda for example?:

def promo(dataframe, a):  
    i=0
    while i < len(dataframe)-1:
        i=i+1
        if dataframe.iloc[i-1,5] >= a:
            dataframe.iloc[i-1,6] = 1
        else:
            dataframe.iloc[i-1,6] = 0

    return dataframe
SiHa
  • 7,830
  • 13
  • 34
  • 43
PV8
  • 5,799
  • 7
  • 43
  • 87
  • 1
    it's easier to understand if you provide some explanation of **trying to create a new column, based on a condition in another column** – Sociopath Sep 11 '18 at 06:56
  • If you're using loops with dataframe rows, you're usually doing something wrong. Please show example input and output – OneCricketeer Sep 11 '18 at 06:58
  • 1
    @cricket_007 particularly *indexing* for-loops. If you have to loop over rows, then you should at least use `itertuples`. – juanpa.arrivillaga Sep 11 '18 at 07:02

1 Answers1

2

Don't use loops in pandas, they are slow compared to a vectorized solution - convert boolean mask to integers by astype True, False are converted to 1, 0:

dataframe = pd.DataFrame({'A':list('abcdef'),
                   'B':[4,5,4,5,5,4],
                   'C':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':list('aaabbb'),
                   'F':[5,3,6,9,2,4],
                   'G':[5,3,6,9,2,4]
})

a = 5
dataframe['new'] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
   A  B  C  D  E  F  G  new
0  a  4  7  1  a  5  5    1
1  b  5  8  3  a  3  3    0
2  c  4  9  5  a  6  6    1
3  d  5  4  7  b  9  9    1
4  e  5  2  1  b  2  2    0
5  f  4  3  0  b  4  4    0

If you want to overwrite the 7th column:

a = 5
dataframe.iloc[:,6] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
   A  B  C  D  E  F  G
0  a  4  7  1  a  5  1
1  b  5  8  3  a  3  0
2  c  4  9  5  a  6  1
3  d  5  4  7  b  9  1
4  e  5  2  1  b  2  0
5  f  4  3  0  b  4  0
M.T
  • 4,917
  • 4
  • 33
  • 52
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @M.T - Thank you, +3 – jezrael Sep 11 '18 at 08:08
  • I do not understand why you are using iloc[]. In this case I would use: df.loc[:, 5] or df[5]. Can you explain me if there is any difference in performance? – Antonio Andrés Sep 11 '18 at 08:14
  • @AntonioAndrés - It depends if want seelct `6th column` like in my solution, then need `dataframe.iloc[:,5]`. But if default columns names by RangeIndex then is possible use `df[5]` or `df.loc[:, 5]` – jezrael Sep 11 '18 at 08:16
  • Ok, .iloc[] is more correct because the column notation is: 1, 2, 3. If the column notation will be: 'A', 'B', 'C'. the answer will use .loc[]. That's right? – Antonio Andrés Sep 11 '18 at 08:18
  • @AntonioAndrés - It depends what need - if want select by label or select by positions, check [this](https://stackoverflow.com/q/31593201) for difference. – jezrael Sep 11 '18 at 08:20
  • 1
    Ok, I understand you, iloc[] is for position also in columns not only in rows. Thanks! – Antonio Andrés Sep 11 '18 at 08:21
  • Is there any oppurtunity if i do not want to have 1 or 0, instead I need another number e.g. 2? – PV8 Sep 12 '18 at 06:57
  • 1
    @PV8 - Then use `dataframe['new'] = np.where(dataframe.iloc[:,5] >= a, 1, 2)` – jezrael Sep 12 '18 at 06:59
  • thx, does this also work for multiple conditions, if i want to check the column with a and b, and assign another value? – PV8 Sep 13 '18 at 09:43
  • @PV8 - Yes, then use `dataframe['new'] = np.where((dataframe.iloc[:,5] >= a) & (dataframe.iloc[:,5] >= b), 1, 2)` – jezrael Sep 13 '18 at 10:12
  • @PV8 - sure, change `dataframe.iloc[:,5]` to `dataframe['colname']` – jezrael Oct 01 '18 at 11:40