0

I don't know to approach this issue. I have a data frame that looks like this

cuenta_bancaria nombre_empresa  perfil_cobranza  usuario_id  usuario_web 
5545              a              123              500199         5012
5551              a              123              500199         3321
5551              a               55              500199         5541
5551              b               55              500199         5246

What I need to do is to iterate between each row per usuario_id and check if there's a difference between each row, and create a new data set with the row changed and the usuario_web in charge of this change, to generate a data frame that looks like this:

usuario_id     cambio           usuario_web
 500199       cuenta_bancaria    3321
 500199       perfil_cobranza    5541
 500199       nombre_empresa     5246

Is there any way to do this? I'm working with pandas on python and this dataset could be a little big, let's say around 10000 rows, sorted by usuario_id.

Thanks for any advice.

cs95
  • 379,657
  • 97
  • 704
  • 746
Jesus Rincon
  • 136
  • 9
  • 1
    `check if there's a difference between each row` with respect to what columns? That's important. – cs95 Dec 18 '17 at 16:13
  • Forgot to add that. I need to compare, in this case, nombre_empresa, perfil_cobranza and cuenta_bancaria. Let's say between row 1 and row 2 the field cuenta_bancaria has changed, so I need to put that on the new data frame. Then compare again the row 2 and row 3 and check the difference, in this case on perfil_cobranza, and put that on the new data frame. – Jesus Rincon Dec 18 '17 at 16:20
  • What if more than one columns change? – cs95 Dec 18 '17 at 16:28
  • That won't happen. The changes are registered per column – Jesus Rincon Dec 18 '17 at 16:30
  • check this out https://stackoverflow.com/questions/40348541/pandas-diff-with-string/40348884#40348884 – louis_guitton Dec 18 '17 at 16:35

2 Answers2

1

Compare adjacent rows with ne + shift, obtain a mask, and use this to

  • index into df to get the required rows
  • index into df.columns to get the required columns which change
c = df.columns.intersection(
        ['nombre_empresa', 'perfil_cobranza', 'cuenta_bancaria']
)

i = df[c].ne(df[c].shift())
j = i.sum(1).eq(1)
df = df.loc[j, ['usuario_id', 'usuario_web']]
df.insert(1, 'cambio', c[i[j].values.argmax(1)])

df

   usuario_id           cambio  usuario_web
1      500199  cuenta_bancaria         3321
2      500199  perfil_cobranza         5541
3      500199   nombre_empresa         5246
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I'm a bit lost here. Should I sort this by usuario_id in order for this to work? – Jesus Rincon Dec 19 '17 at 19:14
  • @JesusRincon I don't think you need to, although I can't say because in this sample, only one ID is provided. Is there something wrong with the result? If yes, I'd encourage you to open a question with a [mcve] of the problem. I'll be able to better understand your issue there. – cs95 Dec 19 '17 at 19:16
  • I'm a bit new here, is there a way I can share you a dataframe to see how I'm getting the information? – Jesus Rincon Dec 19 '17 at 19:26
  • @JesusRincon Yup, through a question :-) Or via email you can find in my profile, but I don't recommend it, because I don't have the bandwidth to respond to emails right now. Your best bet is to open a new question. – cs95 Dec 19 '17 at 19:28
1

There are a couple ways to iterate over a dataframe:

for index, row in df.iterrows():
    #blah blah blah

but since you're wanting to reference the prior row, I think the easiest will be to iterate by position:

df2 = pd.DataFrame()
for i in range(1, np.shape(df)[0]):
    current = df.iloc[i]
    last = df.iloc[i-1]
    newrow = {'usario_id' = current['usario_id'], 'usario_web'= current['usario_web']}
    if current['cuenta_bancaria'] != last['cuenta_bancaria']:
        newrow['cambio'] = 'cuenta_bancaria'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['nombre_empresa'] != last['nombre_empresa']:
        newrow['cambio'] = 'nombre_empresa'
        df2 = df2.append(newrow, ignore_index = False)
    elif current['perfil_cobranza'] != last['perfil_cobranza']:
        newrow['cambio'] = 'perfil_cobranza'
        df2 = df2.append(newrow, ignore_index = False)
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Jacob H
  • 345
  • 1
  • 11