0

I have around 50 columns and the rows of the columns are duplicated with similar values. For exmample as:

Idx    Series    Col1   Col2  Col3  Col4  Col5 .....   Col50 
 0      A                 1                   
 1      A                 1
 2      A                 1
 3      A                 1
 4      B        3             2                          3
 5      B        3             2                          3
 6      B        3             2                          3
 7      B        3             2                          3
 8      C                4                1
 9      C                4                1
10      C                4                1
11      C                4                1

I want to keep one value from the repeated values according to its position and change others to '0'

So if the value is '4' then the value at 4th position is kept and first three are changed to '0'. Similarly if the value is '1' then the first one is kept and remaining are changed to '0'.

The output columns will be:

Idx    Series    Col1   Col2  Col3  Col4  Col5 .....   Col50 
 0      A                 1                   
 1      A                 0
 2      A                 0
 3      A                 0
 4      B        0             0                          0
 5      B        0             2                          0
 6      B        3             0                          3
 7      B        0             0                          0
 8      C                0                1
 9      C                0                0
10      C                0                0
11      C                4                0

Any help will is appreciated.

Thank you..

ToBeEXP
  • 61
  • 1
  • 8

2 Answers2

2

Since your DataFrame does not contain NaN values, I assume that:

  • Column Col1 thru Col50 are of object type,
  • They contain string data, either empty string or a string containing a single digit char.

To get your intended result, define 2 following functions:

  1. A function to be applied to each column to be converted:

    def cnv(col):
        return col.groupby((col != col.shift()).cumsum()).apply(cnv2)
    
  2. A function to be applied to each group in the current column:

    def cnv2(grp):
        if grp.str.isnumeric().all():
            ind = grp.index[0]
            val = int(grp.iloc[0])
            return grp.where(grp.index == val + ind - 1, '0')
        else:
            return grp
    

Then apply cnv function to the columns to be converted, saving the result back to the same columns:

df.loc[:, 'Col1':'Col50'] = df.loc[:, 'Col1':'Col50'].apply(cnv)

The result is:

    Idx Series Col1 Col2 Col3 Col5 Col50
0     0      A         1                
1     1      A         0                
2     2      A         0                
3     3      A         0                
4     4      B    0         0          0
5     5      B    0         2          0
6     6      B    3         0          3
7     7      B    0         0          0
8     8      C         0         1      
9     9      C         0         0      
10   10      C         0         0      
11   11      C         4         0      
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • Thanks a lot for the help. Actually i am getting error with "if grp.str.isnumeric().all():" and error is: AttributeError: Can only use .str accessor with string values!, which clearly means that the values are not str or object type and the columns without values are actually empty columns.. – ToBeEXP Nov 17 '20 at 13:04
  • I assumed that all *Col...* columns are of *object* type and contain **strings**. Apparently something in your DataFrame is arranged other way, but you didn't specify how. Generate your DataFrame as I expected (all strings, maybe empty) and then try my code. – Valdi_Bo Nov 17 '20 at 14:57
  • https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code – ansev Nov 18 '20 at 08:08
  • @Valdi_Bo: Thanks a lot for the help. After changing the datatypes, the code is working now. I appreciate your help.. – ToBeEXP Nov 18 '20 at 15:48
1

Lets do

df2 = df.replace(r'^\s*$', np.nan, regex=True)
new_df = (df2.assign(**df2.filter(regex='Col')
                          .where(lambda x: x.eq(df.groupby('Series')
                                                .cumcount()
                                                .add(1), 
                                             axis=0) | x.isna() , 
                              0)
                  )
             .fillna('') # I recommend remove this line
         )
print(new_df)

Output

    Idx Series Col1 Col2 Col3
0     0      A         1     
1     1      A         0     
2     2      A         0     
3     3      A         0     
4     4      B    0         0
5     5      B    0         2
6     6      B    3         0
7     7      B    0         0
8     8      C         0     
9     9      C         0     
10   10      C         0     
11   11      C         4     
ansev
  • 30,322
  • 5
  • 17
  • 31
  • @ ansev: Thanks for the answer. I am getting error with Series column in line ".where(lambda x: x.eq(df2.groupby('Series)" . I think the series column might be set as index and that might be the reason for error. I will add another index and try to see if it works. – ToBeEXP Nov 17 '20 at 13:11
  • Groupby.cumcount let the original Index because assign a int to each SeriesGroup value? What is the Code of error? What pandas versión are you using? Thanks in advance – ansev Nov 17 '20 at 13:23
  • Thanks a lot @ansev. I made the necessary modifications and its working now. Just a small question. If i want to replace all the values (0,1,2, 3) with anohter value lets say with 1, how can i do it according to your code and not separately by passing the range to pandas .replace function ? – ToBeEXP Nov 18 '20 at 14:19
  • You can add `.where(lambda x: x.notna(), 1)` after .replace method – ansev Nov 18 '20 at 14:52
  • After .replace method, it will replace all the values i mean the repeated values before applying the groupby method. I want to replace them after the repeated values are changed to -1 and only one value from (0,1,2,3) is left at its index position. Now i want to change it so this means i will add the code you sugested before .fllna() function right ? – ToBeEXP Nov 18 '20 at 15:52