9

I have dataframe which should be filled by understanding rows understanding like we do in excel. If its continious integer it fill by next number itself.

Is there any function in python like this?

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df
    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   NaN     NaN     NaN
5   NaN     NaN     NaN

output required:

    year    cat1    cat2
0   2019.0  1.0     c1
1   2020.0  2.0     c1
2   2019.0  3.0     c1
3   2020.0  4.0     c2
4   2019.0  5.0     c2 #here can be ignored if it can't understand the earlier pattern
5   2020.0  6.0     c2 #here can be ignored if it can't understand the earlier pattern

I tried df.interpolate(method='krogh') #it fill 1,2,3,4,5,6 but incorrect others.

vishvas chauhan
  • 229
  • 1
  • 12
  • Did you try `df.interpolate(method='linear')`? If the values are evenly spaced, I would expect this to work. – Chuck Tucker Oct 08 '21 at 12:10
  • Off the top of my head, the only way I can think of to fill the year would be to use `.apply()` and a function to return 2019 if the index is even and 2020 if the index is odd. I am unaware of a generic method to detect and fill a pattern. – Chuck Tucker Oct 08 '21 at 12:18
  • I can setup a list based on a pattern and amend it rows but I can only do it one by one for each row. There should be something autofill method... .. – vish community Oct 08 '21 at 12:42
  • For cat2, could please give more information on the pattern ? Not enough info, eg would there be a c3 ? How would it look like. is it c1,c1,c1,c2,nan,nan,c3,c3,nan or ? – EBDS Oct 11 '21 at 00:42
  • For cat1, I see you already have the solution, so I won't provide new method. Yours seems good already. For year, I've provided a very straightforward 3 liner, generic code (value independent). – EBDS Oct 11 '21 at 00:42
  • Another point I would like to clarify is: while the data seems to have relationships between columns, the point that you are looking for excel drag means that our approach should not focus on the relationship ? Basically each column are handled on its own ? Thanks for confirming. – EBDS Oct 11 '21 at 01:02

5 Answers5

5

Here is my solution for the specific use case you mention -

The code for these helper functions for categorical_repeat, continous_interpolate and other is provided below in EXPLANATION > Approach section.

config = {'year':categorical_repeat,    #shortest repeating sequence
          'cat1':continous_interpolate, #curve fitting (linear)
          'cat2':other}                 #forward fill

print(df.agg(config))
     year  cat1 cat2
0  2019.0     1   c1
1  2020.0     2   c1
2  2019.0     3   c1
3  2020.0     4   c2
4  2019.0     5   c2
5  2020.0     6   c2

EXPLANATION:

As I understand, there is no direct way of handling all types of patterns in pandas as excel does. Excel involves linear interpolation for continuous sequences, but it involves other methods for other column patterns.

  1. Continous integer array -> linear interpolation
  2. Repeated cycles -> Smallest repeating sequence
  3. Alphabet (and similar) -> Tiling fixed sequence until the length of df
  4. Unrecognizable pattern -> Forward fill

Here is the dummy dataset that I attempt my approach on -

data = {'A': [2019, 2020, 2019, 2020, 2019, 2020],
        'B': [1, 2, 3, 4, 5, 6],
        'C': [6, 5, 4, 3, 2, 1],
        'D': ['C', 'D', 'E', 'F', 'G', 'H'],
        'E': ['A', 'B', 'C', 'A', 'B', 'C'],
        'F': [1,2,3,3,4,2]
       }

df = pd.DataFrame(data)
empty = pd.DataFrame(columns=df.columns, index=df.index)[:4]
df_new = df.append(empty).reset_index(drop=True)
print(df_new)
      A    B    C    D    E    F
0  2019    1    6    C    A    1
1  2020    2    5    D    B    2
2  2019    3    4    E    C    3
3  2020    4    3    F    A    3
4  2019    5    2    G    B    4
5  2020    6    1    H    C    2
6   NaN  NaN  NaN  NaN  NaN  NaN
7   NaN  NaN  NaN  NaN  NaN  NaN
8   NaN  NaN  NaN  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN  NaN

Approach:

Let's start with some helper functions -

import numpy as np
import scipy as sp
import pandas as pd


#Curve fitting (linear)
def f(x, m, c):
    return m*x+c     #Modify to extrapolate for exponential sequences etc.

#Interpolate continous linear
def continous_interpolate(s):
    clean = s.dropna()
    popt, pcov = sp.optimize.curve_fit(f, clean.index, clean)
    output = [round(i) for i in f(s.index, *popt)]  #Remove the round() for float values
    return pd.Series(output)

#Smallest Repeating sub-sequence
def pattern(inputv):
    '''
    https://stackoverflow.com/questions/6021274/finding-shortest-repeating-cycle-in-word
    '''
    pattern_end =0
    for j in range(pattern_end+1,len(inputv)):

        pattern_dex = j%(pattern_end+1)
        if(inputv[pattern_dex] != inputv[j]):

            pattern_end = j;
            continue

        if(j == len(inputv)-1):
            return inputv[0:pattern_end+1];
    return inputv;

#Categorical repeat imputation
def categorical_repeat(s):
    clean = s.dropna()
    cycle = pattern(clean)
    
    repetitions = (len(s)//len(cycle))+1
    output = np.tile(cycle, repetitions)[:len(s)]
    return pd.Series(output)

#continous sequence of alphabets
def alphabet(s):
    alp = 'abcdefghijklmnopqrstuvwxyz'
    alp2 = alp*((len(s)//len(alp))+1)
    
    start = s[0]
    idx = alp2.find(start.lower())
    output = alp2[idx:idx+len(s)]

    if start.isupper():
        output = output.upper()
    
    return pd.Series(list(output))

#If no pattern then just ffill
def other(s):
    return s.ffill()

Next, lets create a configuration based on what we want to solve and apply the methods required -

config = {'A':categorical_repeat,
          'B':continous_interpolate, 
          'C':continous_interpolate, 
          'D':alphabet,
          'E':categorical_repeat, 
          'F':other}

output_df = df_new.agg(config)
print(output_df)
      A   B  C  D  E  F
0  2019   1  6  C  A  1
1  2020   2  5  D  B  2
2  2019   3  4  E  C  3
3  2020   4  3  F  A  3
4  2019   5  2  G  B  4
5  2020   6  1  H  C  2
6  2019   7  0  I  A  2
7  2020   8 -1  J  B  2
8  2019   9 -2  K  C  2
9  2020  10 -3  L  A  2
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
2

I tested some stuff out and did some more research. It appears pandas does not currently offer the functionality you're looking for.

df['cat'].interpolate(method='linear') will only work if the first/last values are filled in already. You would have to manually assign df.loc[5, 'cat1'] = 6 in this example, then a linear interpolation would work.

Some Options:

  1. If the data is small enough, you can always export to Excel and use the fill there, then bring back into pandas.

  2. Analyze the patterns yourself and design your own fill methods. For example, to get the year, you can use df['year'] = df.index.to_series().apply(lambda x: 2019 if x % 2 == 0 else 2020).

There are other Stack Overflow questions very similar to this, and none that I saw have a generic answer.

Chuck Tucker
  • 259
  • 1
  • 8
  • Thanks for your testing and suggestion.. . I am looking for autofill method or def fun that can work for it. – vishvas chauhan Oct 08 '21 at 13:31
  • 1
    sorry for negative response from someone.. – vishvas chauhan Oct 08 '21 at 13:34
  • The autofill methods that are out there `ffill`, `bfill`, `fillna`, and `interpolate` don't have options that will give you exactly what you want for the entire dataframe. If there is something else out there that can intelligently fill NA values at the end of a dataframe, I'm not familiar with it after 5 years of working with pandas, and I was unable to find it in pandas documentation or on other SO questions. I still think you're going to need to write a custom solution for each column. Best of luck. – Chuck Tucker Oct 08 '21 at 13:52
  • I put bounty on this question.. can you please look for this question .. only to fix the loop.. https://stackoverflow.com/questions/69515980/for-loop-is-not-working-while-appending-rows – vishvas chauhan Oct 10 '21 at 14:35
  • i have added a solution which solves what you are looking for, and is scalable / modifiable to other cases. do check and play around with it. – Akshay Sehgal Oct 10 '21 at 17:29
  • Thanks Akshay, Its really helpful.. and I am testing and exploring it more. – vishvas chauhan Oct 13 '21 at 18:33
1

Below is my answer for the year. I understand that the cat1 is handled and cat2 can be ignored. One assumption I've made base on looking at the question is that the repeat pattern is consistent. If not, the factorize may not work.

The idea is to use factorise to extract the repeat pattern. Then form a list of the repeat pattern. The excel drag function is a cycle of the repeat pattern. So it's natural to use itertools cycle. (PS: this is first done in @jabellcu answer, so I don't want to take credit for it. If you think factorize + cycle is good, please check his answer.)

An advantage is that the codes is generic. You don't have to hardcode the values. You can turn it into a function and call it for whatever values there are in the dataframe.

import pandas as pd
d = { 'year': [2019,2020,2019,2020,np.nan,np.nan], 'cat1': [1,2,3,4,np.nan,np.nan], 'cat2': ['c1','c1','c1','c2',np.nan,np.nan]}
df = pd.DataFrame(data=d)
df

Enhanced Answer:

l = df['year'].factorize()[1].to_list()
c = cycle(l)
df['year'] = [next(c) for i in range(len(df))]

df['cat1'] = df['cat1'].interpolate(method='krogh') 
df['cat2'] = df['cat2'].fillna(method='ffill')
df

PS: I've left a question on post regarding how to handle cat2. Currently, I just assume it's ffill for the time being.

enter image description here

From the reading of the question, I assume that you don't need detection logic. So I won't provide. I just provide the conversion logic.

EBDS
  • 1,244
  • 5
  • 16
  • 1
    This doesn't answer OP's question fully. It is also not generalizable to other cases with different repeating sequences. – jabellcu Oct 11 '21 at 16:28
  • @jabellcu The question does not require solution for all columns, only for column year. cat1 is already handled, cat2, not necessary to provide, anyway, a fill forward will handle it. But there is not enough information to say ffill is correct, cause we don't have enough data in the question for cat2. As for detection logic (which column to apply which conversion, I don't think OP wanted that from reading the question). BTW, I used itertools cycle to enhance my code. I've already asked OP not to check this answer. I did it because it is a better solution, factorize + cycle. – EBDS Oct 12 '21 at 02:39
1

I would do the following:

from pandas.api.types import is_numeric_dtype
from itertools import cycle

def excel_drag(column):
    
    S = column[column.bfill().dropna().index].copy()  # drop last empty values
    numeric = is_numeric_dtype(S)
    groups = S.groupby(S, sort=False).apply(lambda df: len(df.index))
    
    if (len(groups) == len(S)):
        if numeric:
            # Extrapolate
            return column.interpolate(method='krogh')
        else:
            # ffill
            return column.ffill()
        
    elif (groups == groups.iloc[0]).all():  # All equal
        # Repeat sequence
        seq_len = len(groups)
        seq = cycle(S.iloc[:seq_len].values)
        filling = column[column.bfill().isna()].apply(lambda x: next(seq))
        return column.fillna(filling)
    
    else:
        # ffill
        return column.ffill()

With that function, df.apply(excel_drag, axis=0) results in:

     year  cat1 cat2
0  2019.0   1.0   c1
1  2020.0   2.0   c1
2  2019.0   3.0   c1
3  2020.0   4.0   c2
4  2019.0   5.0   c2
5  2020.0   6.0   c2
jabellcu
  • 692
  • 8
  • 20
0

Try using fillna(value) method where it replaces the Nan with the value passed into it.

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 08 '21 at 13:29
  • Thanks for your anwer. What fillna method you want to suggest to fill the values in col['year'] – vishvas chauhan Oct 08 '21 at 13:33
  • Please add some details to your answer. –  Oct 08 '21 at 15:12