How to manipulate repeated data values in mulitple columns with python

Question

I have around 50 columns and the rows of the columns are duplicated with similar values. For exmample as:

Idx    Series    Col1   Col2  Col3  Col4  Col5 .....   Col50 
 0      A                 1                   
 1      A                 1
 2      A                 1
 3      A                 1
 4      B        3             2                          3
 5      B        3             2                          3
 6      B        3             2                          3
 7      B        3             2                          3
 8      C                4                1
 9      C                4                1
10      C                4                1
11      C                4                1

I want to keep one value from the repeated values according to its position and change others to '0'

So if the value is '4' then the value at 4th position is kept and first three are changed to '0'. Similarly if the value is '1' then the first one is kept and remaining are changed to '0'.

The output columns will be:

Idx    Series    Col1   Col2  Col3  Col4  Col5 .....   Col50 
 0      A                 1                   
 1      A                 0
 2      A                 0
 3      A                 0
 4      B        0             0                          0
 5      B        0             2                          0
 6      B        3             0                          3
 7      B        0             0                          0
 8      C                0                1
 9      C                0                0
10      C                0                0
11      C                4                0

Any help will is appreciated.

Thank you..

score 2 · Answer 1 · answered Nov 16 '20 at 18:33

Since your DataFrame does not contain NaN values, I assume that:

Column Col1 thru Col50 are of object type,
They contain string data, either empty string or a string containing a single digit char.

To get your intended result, define 2 following functions:

A function to be applied to each column to be converted:

def cnv(col):
    return col.groupby((col != col.shift()).cumsum()).apply(cnv2)

A function to be applied to each group in the current column:

def cnv2(grp):
    if grp.str.isnumeric().all():
        ind = grp.index[0]
        val = int(grp.iloc[0])
        return grp.where(grp.index == val + ind - 1, '0')
    else:
        return grp

Then apply cnv function to the columns to be converted, saving the result back to the same columns:

df.loc[:, 'Col1':'Col50'] = df.loc[:, 'Col1':'Col50'].apply(cnv)

The result is:

    Idx Series Col1 Col2 Col3 Col5 Col50
0     0      A         1                
1     1      A         0                
2     2      A         0                
3     3      A         0                
4     4      B    0         0          0
5     5      B    0         2          0
6     6      B    3         0          3
7     7      B    0         0          0
8     8      C         0         1      
9     9      C         0         0      
10   10      C         0         0      
11   11      C         4         0

Thanks a lot for the help. Actually i am getting error with "if grp.str.isnumeric().all():" and error is: AttributeError: Can only use .str accessor with string values!, which clearly means that the values are not str or object type and the columns without values are actually empty columns.. — ToBeEXP, Nov 17 '20 at 13:04
I assumed that all *Col...* columns are of *object* type and contain **strings**. Apparently something in your DataFrame is arranged other way, but you didn't specify how. Generate your DataFrame as I expected (all strings, maybe empty) and then try my code. — Valdi_Bo, Nov 17 '20 at 14:57
https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code — ansev, Nov 18 '20 at 08:08
@Valdi_Bo: Thanks a lot for the help. After changing the datatypes, the code is working now. I appreciate your help.. — ToBeEXP, Nov 18 '20 at 15:48

ansev · Accepted Answer · 2020-11-16T18:15:20.363

1

Lets do

df2 = df.replace(r'^\s*$', np.nan, regex=True)
new_df = (df2.assign(**df2.filter(regex='Col')
                          .where(lambda x: x.eq(df.groupby('Series')
                                                .cumcount()
                                                .add(1), 
                                             axis=0) | x.isna() , 
                              0)
                  )
             .fillna('') # I recommend remove this line
         )
print(new_df)

Output

    Idx Series Col1 Col2 Col3
0     0      A         1     
1     1      A         0     
2     2      A         0     
3     3      A         0     
4     4      B    0         0
5     5      B    0         2
6     6      B    3         0
7     7      B    0         0
8     8      C         0     
9     9      C         0     
10   10      C         0     
11   11      C         4

edited Nov 16 '20 at 18:15

answered Nov 16 '20 at 18:04

ansev

30,322
5
17
31

@ ansev: Thanks for the answer. I am getting error with Series column in line ".where(lambda x: x.eq(df2.groupby('Series)" . I think the series column might be set as index and that might be the reason for error. I will add another index and try to see if it works. – ToBeEXP Nov 17 '20 at 13:11
Groupby.cumcount let the original Index because assign a int to each SeriesGroup value? What is the Code of error? What pandas versión are you using? Thanks in advance – ansev Nov 17 '20 at 13:23
Thanks a lot @ansev. I made the necessary modifications and its working now. Just a small question. If i want to replace all the values (0,1,2, 3) with anohter value lets say with 1, how can i do it according to your code and not separately by passing the range to pandas .replace function ? – ToBeEXP Nov 18 '20 at 14:19
You can add `.where(lambda x: x.notna(), 1)` after .replace method – ansev Nov 18 '20 at 14:52
After .replace method, it will replace all the values i mean the repeated values before applying the groupby method. I want to replace them after the repeated values are changed to -1 and only one value from (0,1,2,3) is left at its index position. Now i want to change it so this means i will add the code you sugested before .fllna() function right ? – ToBeEXP Nov 18 '20 at 15:52

How to manipulate repeated data values in mulitple columns with python

2 Answers2