0

I'm struggeling with some data massaging process. I'm thinking of a best way to find and remove non-changing entries in dataframe.

I have signal level data that comes from mobile device

measDate    measLatitude    measLongitude   measCellId  measNetTypeDetail   measOperatorCode    measSignalLevel
2019-06-05 00:22:10.791     27.676038   84.177025   14603   13  42902   -97 
2019-06-05 00:22:11.806     27.676038   84.177025   14603   13  42902   -97 
2019-06-05 00:22:14.179     27.676038   84.177025   14604   13  42902   -97 
2019-06-05 00:22:14.279     27.676038   84.177025   14604   13  42902   -97 
2019-06-05 00:22:16.657     27.676038   84.177025   14604   13  42902   -97 
2019-06-05 00:22:18.904     27.676038   84.177025   14615   13  42902   -96 
2019-06-05 00:22:21.276     27.676038   84.177025   14615   13  42902   -96 
2019-06-05 00:22:23.557     27.676038   84.177025   14614   13  42902   -95 
2019-06-05 00:22:24.796     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:26.768     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:27.787     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:28.802     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:31.803     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:33.799     27.676038   84.177025   14603   10  42902   -96 

so basically signalLevel and some associated data + timestamp

What I need to do is to filter data in following way:

  • if signal level does not change for the past n-samples (eg. n=3 samples)
  • if other values does not change as well
  • filter-out the data.

So in the end I'd like to have maximum n-same-samples in row.

Expected results with WINDOW N=3

measDate    measLatitude    measLongitude   measCellId  measNetTypeDetail   measOperatorCode    measSignalLevel
2019-06-05 00:22:10.791     27.676038   84.177025   14603   13  42902   -97 
2019-06-05 00:22:11.806     27.676038   84.177025   14603   13  42902   -97 
2019-06-05 00:22:14.179     27.676038   84.177025   14604   13  42902   -97 
2019-06-05 00:22:18.904     27.676038   84.177025   14615   13  42902   -96 
2019-06-05 00:22:21.276     27.676038   84.177025   14615   13  42902   -96 
2019-06-05 00:22:23.557     27.676038   84.177025   14614   13  42902   -95 
2019-06-05 00:22:24.796     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:26.768     27.676038   84.177025   14603   10  42902   -96 
2019-06-05 00:22:27.787     27.676038   84.177025   14603   10  42902   -96

I've came to this code after some tests. I try to check if any of the columns value changed by checking unique values within window per column (here window size =3) then I sum if any of the column changed then I need to keep it

RadioSmall = RadioMeasAll.loc[:,['measLatitude','measLongitude','measCellId','measNetTypeDetail','measOperatorCode','measCid','measLac','measSignalLevel','cellArfcn']].copy()

def f(x):
    y = x[~np.isnan(x)]
    if len(np.unique(y)) > 1:
        return 1
    else:
        return 0

a = RadioSmall.rolling(window=3, min_periods=1).apply(f,raw=True)
a['sum']=a.sum(axis=1)
b = pd.DataFrame(index=a.index)
b['filtering'] = a['sum']
df_filtered =  b.query('filtering>0')

RadioMeasAll.join(df_filtered)

Looks very ugly to me.

Thanks for help!

morf
  • 125
  • 11
  • 1
    Have a look at [How to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/463796) - please post a subset of your actual data, not screenshots of data, so we can reproduce your problem. – w-m Jun 05 '19 at 09:28
  • Thanks @w-m for pointing that out. I've edited my post – morf Jun 05 '19 at 10:39

3 Answers3

1
import numpy as np

cols_to_compare = df.columns.drop(['measDate'])  # Columns where the values will be compared
window_size = 3

has_change = df[cols_to_compare].diff().any(axis=1)
df = df[has_change.rolling(window_size).apply(np.any, raw=True).fillna(1, downcast="bool")]
GZ0
  • 4,055
  • 1
  • 10
  • 21
0

I believe I understand your goal now.

df.reset_index(drop=False) # I want to opearte on integer indices
a = df.index[(df[columns] == df[columns].shift()).all(axis=1)]

a is an array containing indices of consecutive rows that have values of columns equal to the following row (it ommits the FIRST occurrence).

Now we should split it into subsets of consecutive indices, (idea from this answer)

duplicate_groups = np.split(a, np.where(np.diff(a) != 1)[0] + 1)

duplicate_groups now contains arrays of consecutive indices. You can now filter for consectuive duplicates longer than N, let's say N = 5.

[arr for arr in duplicate_groups if len(arr) > 4] # 4 b/c we ommitted FIRST occurrence

Try to run it on your data and see if it solves your problem.

dylan_fan
  • 680
  • 1
  • 5
  • 18
  • I understand columns is an array of columns to be checked. but what is id_ ? I've edited my post with my sample code and replicable set – morf Jun 05 '19 at 10:44
  • Sorry, `id_` was supposed to be `a` – dylan_fan Jun 05 '19 at 11:11
  • after you run first two lines, what does `a` contain ? – dylan_fan Jun 05 '19 at 11:38
  • a returns empty `[]` `columns = ['measLatitude','measLongitude','measCellId','measNetTypeDetail','measOperatorCode','measLac','measSignalLevel','cellArfcn'] df.reset_index(drop=False) a = df.index[(df[columns] == df[columns].shift()).all(axis=1)]` – morf Jun 05 '19 at 11:43
  • Is any of these columns in float? Change your `columns` variable so it only has `measSignalLevel` and see then, then add more columns and check which one makes a to be empty – dylan_fan Jun 05 '19 at 11:47
0

I would use a temporary dataframe sharing same index as the original one to compute the groups of consecutive values, their size and the rank of a row in its group.

Then I would only keep in the original dataframe the rows having a rank below the window:

tmp = pd.DataFrame(index=df.index)                     # tmp dataframe with same index
tmp['dup'] = df[df.columns[1:].tolist()].duplicated()  # duplicated rows (except for datetime)
# define groups of consecutive rows
tmp.loc[~tmp['dup'],'change'] = 1               
tmp['group'] = tmp['change'].cumsum()
tmp['group'].fillna(method='ffill', inplace=True)

# compute ranks in groups 
tmp['rank'] = tmp.groupby('group').cumcount()

# extract a filtered dataframe
filtered = df.loc[tmp['rank'] < 3]

It gives as expected:

                  measDate  measLatitude  measLongitude  measCellId  measNetTypeDetail  measOperatorCode  measSignalLevel
0  2019-06-05 00:22:10.791     27.676038      84.177025       14603                 13             42902              -97
1  2019-06-05 00:22:11.806     27.676038      84.177025       14603                 13             42902              -97
2  2019-06-05 00:22:14.179     27.676038      84.177025       14604                 13             42902              -97
3  2019-06-05 00:22:14.279     27.676038      84.177025       14604                 13             42902              -97
4  2019-06-05 00:22:16.657     27.676038      84.177025       14604                 13             42902              -97
5  2019-06-05 00:22:18.904     27.676038      84.177025       14615                 13             42902              -96
6  2019-06-05 00:22:21.276     27.676038      84.177025       14615                 13             42902              -96
7  2019-06-05 00:22:23.557     27.676038      84.177025       14614                 13             42902              -95
8  2019-06-05 00:22:24.796     27.676038      84.177025       14603                 10             42902              -96
9  2019-06-05 00:22:26.768     27.676038      84.177025       14603                 10             42902              -96
10 2019-06-05 00:22:27.787     27.676038      84.177025       14603                 10             42902              -96

Note: if in the original dataframe the date time was the index instead of being a column, the definition of the duplicates would change to:

tmp['dup'] = df.duplicated()  # duplicated rows 
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252