Iterative comparison with pandas

Question

I don't know to approach this issue. I have a data frame that looks like this

cuenta_bancaria nombre_empresa  perfil_cobranza  usuario_id  usuario_web 
5545              a              123              500199         5012
5551              a              123              500199         3321
5551              a               55              500199         5541
5551              b               55              500199         5246

What I need to do is to iterate between each row per usuario_id and check if there's a difference between each row, and create a new data set with the row changed and the usuario_web in charge of this change, to generate a data frame that looks like this:

usuario_id     cambio           usuario_web
 500199       cuenta_bancaria    3321
 500199       perfil_cobranza    5541
 500199       nombre_empresa     5246

Is there any way to do this? I'm working with pandas on python and this dataset could be a little big, let's say around 10000 rows, sorted by usuario_id.

Thanks for any advice.

`check if there's a difference between each row` with respect to what columns? That's important. — cs95, Dec 18 '17 at 16:13
Forgot to add that. I need to compare, in this case, nombre_empresa, perfil_cobranza and cuenta_bancaria. Let's say between row 1 and row 2 the field cuenta_bancaria has changed, so I need to put that on the new data frame. Then compare again the row 2 and row 3 and check the difference, in this case on perfil_cobranza, and put that on the new data frame. — Jesus Rincon, Dec 18 '17 at 16:20
check this out https://stackoverflow.com/questions/40348541/pandas-diff-with-string/40348884#40348884 — louis_guitton, Dec 18 '17 at 16:35

cs95 · Accepted Answer · 2017-12-18T16:39:41.540

1

Compare adjacent rows with ne + shift, obtain a mask, and use this to

index into df to get the required rows
index into df.columns to get the required columns which change

c = df.columns.intersection(
        ['nombre_empresa', 'perfil_cobranza', 'cuenta_bancaria']
)

i = df[c].ne(df[c].shift())
j = i.sum(1).eq(1)

df = df.loc[j, ['usuario_id', 'usuario_web']]
df.insert(1, 'cambio', c[i[j].values.argmax(1)])

df

   usuario_id           cambio  usuario_web
1      500199  cuenta_bancaria         3321
2      500199  perfil_cobranza         5541
3      500199   nombre_empresa         5246

edited Dec 18 '17 at 16:39

answered Dec 18 '17 at 16:24

cs95

379,657
97
704
746

I'm a bit lost here. Should I sort this by usuario_id in order for this to work? – Jesus Rincon Dec 19 '17 at 19:14
@JesusRincon I don't think you need to, although I can't say because in this sample, only one ID is provided. Is there something wrong with the result? If yes, I'd encourage you to open a question with a [mcve] of the problem. I'll be able to better understand your issue there. – cs95 Dec 19 '17 at 19:16
I'm a bit new here, is there a way I can share you a dataframe to see how I'm getting the information? – Jesus Rincon Dec 19 '17 at 19:26
@JesusRincon Yup, through a question :-) Or via email you can find in my profile, but I don't recommend it, because I don't have the bandwidth to respond to emails right now. Your best bet is to open a new question. – cs95 Dec 19 '17 at 19:28

score 1 · Answer 2 · edited Dec 21 '19 at 22:13

There are a couple ways to iterate over a dataframe:

for index, row in df.iterrows():
    #blah blah blah

but since you're wanting to reference the prior row, I think the easiest will be to iterate by position:

df2 = pd.DataFrame()
for i in range(1, np.shape(df)[0]):
    current = df.iloc[i]
    last = df.iloc[i-1]
    newrow = {'usario_id' = current['usario_id'], 'usario_web'= current['usario_web']}
    if current['cuenta_bancaria'] != last['cuenta_bancaria']:
        newrow['cambio'] = 'cuenta_bancaria'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['nombre_empresa'] != last['nombre_empresa']:
        newrow['cambio'] = 'nombre_empresa'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['perfil_cobranza'] != last['perfil_cobranza']:
        newrow['cambio'] = 'perfil_cobranza'
        df2 = df2.append(newrow, ignore_index = False)

Worked for me too. Thanks – Jesus Rincon Dec 18 '17 at 19:01 — Jesus Rincon, Dec 18 '17 at 19:01

Iterative comparison with pandas

2 Answers2