check on pandas dataframe

Question

I have a pandas dataframe composed by 3 columns.

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54500   0
    7   54501   54501   -1
    8   54502   55259   0

I would like to implement a check on the difference between start and end of each row. In particular what I would like to do is:

if end row x - start row x  == 0 incorporate this row in the previous row

For example the 8th row

7   54501   54501   -1

has end - start = 0. I would like to modify the dataframe like this

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54501   0
    7   54502   55259   0

and then since the 7th and the 8th row now have the same "value" it should become

    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   55259   0

EDITED

Please note that a particular case would be

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54500   0
    7   54501   54501   -1
    8   54502   54502   0
    9   54503   55259   1

In this case there are 2 consecutive rows (8th and 9th) for which the difference between end and start values is 0. In this case the answer proposed gives an error since the index 7th was deleted previously. I solved this case using a while loop instead of a for loop, but I guess it is not the best thing to do.

For this case we should have

  index  start  end     value
    0       0   37647   0
    1   37648   37846   1
    2   37847   42874   0
    3   42875   43049   1
    4   43050   51352   0
    5   51353   51665   -1
    6   51666   54502   0
    7   54503   55259   1

Khris · Answer 1 · 2017-03-14T14:45:56.653

0

Using numpy where you can do it like this:

import numpy as np

inp = np.where(df.start == df.end)[0]
droplist = []
save = 0
j = 0
for i in range(len(inp)):
    if inp[i] > 0:
        if inp[i]-inp[i-1] == 1:
            j += 1
            save += 1
            df.loc[inp[i]-1-j,"end"] += save
        else:
            j = 0
            save = 0
            df.loc[inp[i]-1,"end"] += 1
        droplist.append(inp[i])
df = df.drop(droplist).reset_index(drop=True)

droplist = []
jnp = np.where(df.value == df.value.shift(-1))[0]
for jj in jnp:
    df.loc[jj,"end"] = df.loc[jj+1,"end"]
    droplist.append(jj+1)
df = df.drop(droplist).reset_index(drop=True)

There might be a more pythonic way without for-loops using numpy though.

EDIT: Fixed for consecutive rows.

edited Mar 14 '17 at 14:45

answered Mar 13 '17 at 12:11

Khris

3,132
3
34
54

Thanks for your solution, but if there are 2 consecutive rows (indices) for which the difference between end and start values is 1 there will be an error because the code won't find the index previously dropped.. – gabboshow Mar 13 '17 at 15:59
@gabboshow - can you update your question to reflect that situation so that we can better answer it? – elPastor Mar 13 '17 at 19:02
Hi Khris thanks! would be nice that if ii == 0 then the first row is merged to the second if the second if the second is OK (ie difference !=0) if also the second has 0 as difference between start and end then first and second row should be merged to the 3rd and so on.. – gabboshow Mar 14 '17 at 09:04
I ve added this to your code: if ii > 0: AltitudeChange.loc[ii-1,"End"] += 1 droplist.append(ii) else: AltitudeChange.loc[ii+1,"Start"] -= 1 droplist.append(ii) what do you think? – gabboshow Mar 14 '17 at 09:12
I'm sorry, the change I made did not solve the problem, I will work on it. Does the method in the other answer not work? – Khris Mar 14 '17 at 10:58
I fixed my code again but now it has become really ugly and bulky, I'm not happy with it. About your added code, looks fine to me, does it work? – Khris Mar 14 '17 at 11:12
the code in the other answer returns warning and I am not confident with them. The code I have added seems to do the job.. – gabboshow Mar 14 '17 at 14:40
ps what was the problem with the previous version of your code? – gabboshow Mar 14 '17 at 14:45

elPastor · Answer 2 · 2017-03-13T13:52:42.120

0

This could probably be cleaned up a bit, but should work.

Code:

# FIRST CHECK
df['end'][df['end'].shift(-1) == df['start'].shift(-1)] = df['end'].shift(-1)
df.drop_duplicates('end', inplace = True)

# SECOND CHECK
df['end'][df['value'].shift(-1) == df['value']] = df['end'].shift(-1)
df['value'][df['value'].shift(-1) == df['value']] = (df['value'] + df['value'].shift(-1)).fillna(0).astype(int)
df.drop_duplicates('end', inplace = True)

Output:

   start    end  value
0      0  37647      0
1  37648  37846      1
2  37847  42874      0
3  42875  43049      1
4  43050  51352      0
5  51353  51665     -1
6  51666  55259      0

edited Mar 13 '17 at 13:52

answered Mar 13 '17 at 12:32

elPastor

8,435
11
53
81

To not get the SettingWithCopyWarning you should use the `loc`/`iloc`-syntax everywhere you assign values to rows or a subset of the dataframe, e.g. when you assign to a subset from a boolean filter like you do. – Khris Mar 13 '17 at 12:35
Thanks @Khris. Updated answer to not throw warning and reduced lines. – elPastor Mar 13 '17 at 12:35

check on pandas dataframe

2 Answers2

Linked