0

I have a dataframe (pandas):

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

I want to merge the values of all columns where the values in A are unique.

ouput = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                     'BC': ['bd', 'bd', 'ce', 'ce'],
                      'D': ['x', 'y', 'y', 'x'],})

It would be best to have a solution that works independently of column names B, C (perhaps there are also more columns with this "redundant information"). The column name of A is known.

Given the case that my initial dataframe is:

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

the desired output is the initial df (no change):

df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'd', 'e'],
                   'D': ['x', 'y', 'y', 'x'],})

Many thanks!

Full solution (thanks to Robby the Belgian):

import pandas as pd
df = pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'],
                   'B': ['b', 'b', 'c', 'c'],
                   'C': ['d', 'd', 'e', 'e'],
                   'D': ['x', 'y', 'y', 'x']})

print(df)

def is_redundant(df, A, B):
    #remove column a
    A=A
    B=B
    if len(df.groupby(f'{A}')) == len(df.groupby([f'{A}', f'{B}'])):

        return True
    else:
        return False

def drop_redundant(df, redundant_groups):
    list=redundant_groups
    for i in list:
        if len(df.groupby(f'{i[0]}')) == len(df.groupby([f'{i[0]}', f'{i[1]}'])):
            df[f'{i[0]}' + f'{i[1]}'] = df[[f'{i[0]}', f'{i[1]}']].sum(axis=1)
            df.drop([f'{i[0]}', f'{i[1]}'], axis=1, inplace=True)
            return(df)
        else:
            return(df)

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

print(redundant_groups)

drop_redundant(df, redundant_groups)

print(df)

Output:

  A  B  C  D
0  x1  b  d  x
1  x2  b  d  y
2  x3  c  e  y
3  x4  c  e  x
[['C', 'B']]
    A  D  CB
0  x1  x  db
1  x2  y  db
2  x3  y  ec
3  x4  x  ec
[Finished in 0.837s]
solaris
  • 5
  • 2
  • 1
    where is your dataframe ? I saw list only – BENY Sep 02 '20 at 15:13
  • 1
    Also those lists are invalid, please spend some time and create a [mcve](https://stackoverflow.com/help/minimal-reproducible-example) , you can take help from this link [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and edit your question – anky Sep 02 '20 at 15:15
  • what is the output if 3rd row is `x3 c d y`? – deadshot Sep 02 '20 at 15:29
  • @deadshot in that case, no change is desired, I updated my initial post. Thank you. – solaris Sep 02 '20 at 15:36
  • @solaris i mean 3rd row in dataframe not 3rd key in the dictionary `pd.DataFrame({'A': ['x1', 'x2', 'x3', 'x4'], 'B': ['b', 'b', 'c', 'c'], 'C': ['d', 'd', 'd', 'e'],'D': ['x', 'y', 'y', 'x'],})` – deadshot Sep 02 '20 at 15:38
  • @deadshot Sorry, misunderstood you. Updated again! – solaris Sep 02 '20 at 15:42
  • do you want to combine columns if all the rows of that columns are equal? – deadshot Sep 02 '20 at 15:45
  • @deadshot Equal in terms of their values relative to the values in column A. – solaris Sep 02 '20 at 15:48
  • how `x1 b d x` relatively equal to `x2 b d y` can you explain how you are selecting the columns to combine with an example – deadshot Sep 02 '20 at 15:50
  • Sorry, confusion over columns and rows again. B and C are somehow redundant as their values point to the same values in column A. – solaris Sep 02 '20 at 15:52
  • @BEN_YO thank you, created proper pandas! – solaris Sep 02 '20 at 16:21

1 Answers1

0

To compare whether columns 'B' and 'C' are "redundant":

len(df.groupby('B')) == len(df.groupby(['B', 'C'])

This checks whether adding 'C' to the grouping labels requires us to add more groups, compared to only grouping by 'B'.

You can then easily run this on all pairs of labels in df.columns (making sure to not include 'A').

If you find that two columns have redundant information, you can use:

df['B' + 'C'] = df[['B', 'C']].sum(axis=1)
df.drop(['B', 'C'], axis=1, inplace=True)

to replace them with the combined information.

If you want to use this in a double loop (checking all pairs of columns), you'll have to be careful, since you might have 3 columns that all contain the same information (say, B, C, and F), and after dealing with B and C you would try to compare B and F -- but column B does no longer exist.

To deal with this, I might try first constructing a list of all pairs that are redundant. Let's assume we have a "is_redundant(df, c1, c2)" function (which uses the above line to compare).

cols = [c for c in df.columns if c != 'A']
redundant_groups = []
idx_left = 0
while idx_left < len(cols)-1:
    new_group = []
    idx_right = idx_left+1
    while idx_right < len(cols):    
        if is_redundant(df, cols[idx_left], cols[idx_right]):
            new_group.append(cols.pop(idx_right))
        else:
            idx_right += 1
    if new_group:
        redundant_groups.append(new_group + [cols[idx_left]])
    idx_left += 1

This creates groups of columns that are all mutually redundant.

After that, you can easily modify the above combination code to deal with multiple columns at once.