2

Supposing I have the following DataFrame df

df = pd.DataFrame({
"a" : [8,8,0,8,8,8,8,8,8,8,4,1,4,4,4,4,4,4,4,4,4,4,7,7,4,4,4,4,4,4,4,4,5,5,5,5,5,5,1,1,5,5,5,5,5,5,1,5,1,5,5,5,5]}

i want to normalize my data, if there is consecutive value less than 3 times, changes the value with neighboring consecutive value.

result:   
 df = pd.DataFrame({
        "a" : [8,8,8,8,8,8,8,8,8,8,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5]}

currently i make this work by iterate manually, and i think pandas has special function to do it.

rachael
  • 55
  • 5
  • There is no special function for this. You may show your attempt, then we may optimize it for faster operation (if any). Using for loop may become very slow for this for large dataset. – Rahul Vishwakarma Aug 08 '20 at 13:52
  • I think the answer here can help you :) https://stackoverflow.com/questions/27626542/counting-consecutive-positive-value-in-python-array – Active_Learner Aug 08 '20 at 15:02

2 Answers2

2

This is a little trycky, use diff(), cumsum() and np.size to find the size of the groups. Use mask() to find groups smaller than 3 and replace those with ffill and bfill

s = df.groupby((df['a'].diff() != 0).cumsum()).transform(np.size)
df['a'] = df[['a']].mask(s < 3).ffill().bfill()

#result
[8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 8., 4., 4., 4., 4., 4.,
   4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 5., 5.,
   5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5., 5.,
   5., 5.]
Terry
  • 2,761
  • 2
  • 14
  • 28
1

Using NumPy will be useful as:

import numpy as np
import pandas as pd

df = pd.DataFrame({"a" : [8,8,0,8,8,8,8,8,8,8,
                          4,1,4,4,4,4,4,4,4,4,
                          4,4,7,7,4,4,4,4,4,4,
                          4,4,5,5,5,5,5,5,1,1,
                          5,5,5,5,5,5,1,5,4,5,
                          5,5,5]})

arr = df.values.reshape(-1)
sub = arr[1:]-arr[:-1]
add2 = sub[1:]+sub[:-1]  
add3 = sub[2:]+sub[:-2]
del2 = np.where((sub[1:]!=0) & (add2*sub[1:]==0))[0]+1
del3 = np.where((sub[2:]!=0) & (add3*sub[2:]==0))[0]+1
arr[del2] = arr[del2-1]
arr[del3] = arr[del3-1]
arr[del3+1] = arr[del3+2]
df = pd.DataFrame({"a" : arr})
print(arr)

'''
Output:
[8 8 8 8 8 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5
 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5]
'''
Rahul Vishwakarma
  • 1,446
  • 2
  • 7
  • 22