New column in dataframe based on location of values in another column

Question

I am trying to create a new column 'ratioA' in a dataframe df whereby the values are related to a column A:

For a given row, df['ratioA'] is equal to the ratio between df['A'] in that row and the next row.

I iterated over the index column as reference, but not sure why the values are appearing as NaN - Technically only the last row should appear as NaN.

import numpy as np
import pandas as pd

series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})

df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()

for i in df['index']:
    df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]

print (df)

The output is:

   index  A  B  ratioA
0      0  1  2     NaN
1      1  3  4     NaN
2      2  5  6     NaN
3      3  7  8     NaN

The desired output should be:

   index  A  B  ratioA
0      0  1  2     0.33
1      1  3  4     0.60
2      2  5  6     0.71
3      3  7  8     NaN

jezrael · Accepted Answer · 2017-07-16T18:28:31.763

1

You can use vectorized solution - divide by div shifted column A:

print (df['A'].shift(-1))
0    3.0
1    5.0
2    7.0
3    NaN
Name: A, dtype: float64

df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
   index  A  B    ratioA
0      0  1  2  0.333333
1      1  3  4  0.600000
2      2  5  6  0.714286
3      3  7  8       NaN

In pandas loops are very slow, so the best is avoid them (Jeff (pandas developer) explain it better.):

for i, row in df.iterrows():
    if i != df.index[-1]:
        df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A'] 
print (df)
   index  A  B    ratioA
0      0  1  2  0.333333
1      1  3  4  0.600000
2      2  5  6  0.714286
3      3  7  8       NaN

Timings:

series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})

df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()


In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop

In [50]: %%timeit 
    ...: for i, row in df.iterrows():
    ...:     if i != df.index[-1]:
    ...:         df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
    ...: 
1 loop, best of 3: 2.15 s per loop

edited Jul 16 '17 at 18:28

answered Jul 16 '17 at 18:13

jezrael

822,522
95
1,334
1,252

What if instead the 'index' column was something a 'time' column in seconds, so i can't use '.loc[ ]' in this case, is there a way of referencing the 'time' column using df['time' == i]['A'] / df['time' == i+1] – user44840 Jul 16 '17 at 19:22
1

I think then need `df = df.set_index('time')` and `df['ratioA'] = df['A'].div(df['A'].shift(-1))`. – jezrael Jul 16 '17 at 19:28
Now if we take loc[i]/loc[i^2] (for i=1,2,3...) but it only goes up to i=1 what can we do to make the others NaN? – user44840 Jul 16 '17 at 19:35
`loc[i]` means select row with index `0,1,2,3` then `loc[i^2]` means select row with index `0,1,4,9`. – jezrael Jul 16 '17 at 19:38
and `df.loc[i,'A']` means select value in column `A` with index `i` – jezrael Jul 16 '17 at 19:38
also in answer is problematic last value, where `df.loc[i+1, 'A']` failed, because is impossible select last index value `i+1` because it doesn not exist. So I add condition `if i != df.index[-1]` for omit loop for last row. – jezrael Jul 16 '17 at 19:43
If i = [1,2,3,.....20] then for loc[i]/loc[i^2] the iteration only goes up to i=4 but this makes the entire code uparsable. Is there a way of taking i only up to 4 and making the rest NaN values? – user44840 Jul 16 '17 at 20:27
If undersnat correctly, need `for i, row in df.iloc[:4].iterrows(): df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i^2, 'A']` – jezrael Jul 16 '17 at 20:32

New column in dataframe based on location of values in another column

1 Answers1