2

I would like to find a solution for the following problem:

import pandas as pd

rows = {'Id': ['xb01','nt02','tw02','dt92','tw03','we04','er04','ew06','re07','ti92'],
    'DatasetName': ['first label','second label','third     label','fourth label','third 
label','third label','third label','fourth label','first  label','last label'],
    'Target': ['first label','second label','the    third labels','fourth label 
set','third    label', 'third label','third label  sets','fourth label    sets','first 
label','last labels']
    }

df = pd.DataFrame(rows, columns = ['Id', 'DatasetName','Target'])

print (df)

The dataframe looks like this:

     Id      DatasetName                      Target

   xb01         first label              first label
   nt02        second label             second label
   tw02     third     label      the    third labels
   dt92        fourth label         fourth label set
   tw03         third label           third    label
   we04         third label              third label
   er04         third label        third label  sets
   ew06        fourth label     fourth label    sets
   re07        first  label              first label
   ti92          last label              last labels

pseudo code:

   for i in len(range(df)):
      if DatasetName[i].is_unique:
         if DatasetName[i]!=Target[i]:
            Target[i]=DatasetName[i]+ '|'+Target[i]
      else:
         loop through dataframe and find all labels that belongs to the same DatasetName 
         and append all those Target names together. (Note: if DatasetName is not same as 
         Target Name(s), the Dataset name should also append to the Target)

Here we can see:

   DatasetName    Appeared   Target

   first label    2          first label
   second label   1          second label
   third label    4          the third labels | third label | third label sets
   fourth label   2          fourth label set | fourth label sets|fourth label
   last label     1          last labels | last label

The expect output:

   Id                  DatasetName                                             Target
  
 xb01                  first label                                        first label
 nt02                 second label                                      second  label
 tw02                  third label      the third labels|third label|third label sets
 dt92                 fourth label   fourth label set|fourth label sets |fourth label
 tw03                  third label      the third labels|third label|third label sets
 we04                  third label      the third labels|third label|third label sets
 er04                  third label      the third labels|third label|third label sets
 ew06                 fourth label   fourth label set|fourth label sets| fourth label
 re07                  first label                                        first label
 ti92                   last label                             last labels|last label
            

Note: The real dataframe have 100,000 rows. There might be extra spaces still exist in those strings(I have already implemented dataframe-lower case(), removed all extra marks,etc.). It might have some mistakes (typo) in this question(I have copied and pasted several times), but hopefully you get my idea for what 's the solution I 'm looking for. Thank you!

almo
  • 561
  • 2
  • 16
  • 1
    Hi, I understand what you are looking for, had the same problem in the past, and solved using networkx https://networkx.org/. – manuzambo May 04 '21 at 09:42
  • Thanks @manuzambo, I will try it out. But if you would to give a demo, I 'll appreciate it :) – almo May 04 '21 at 09:54
  • Check out this question: https://stackoverflow.com/questions/57228605/creating-dictionary-of-parent-child-pairs-in-pandas-dataframe there's a demo in the answers. – manuzambo May 04 '21 at 10:01
  • Thanks @manuzambo, I guess I need to take a deeper look. Thank you:) – almo May 04 '21 at 10:07

1 Answers1

1

Let's try an agg with unique values and merge back:

import pandas as pd

rows = {'Id': ['xb01', 'nt02', 'tw02', 'dt92', 'tw03', 'we04',
               'er04', 'ew06', 're07', 'ti92'],
        'DatasetName': ['first label', 'second label', 'third     label',
                        'fourth label', 'third label', 'third label',
                        'third label', 'fourth label',
                        'first  label', 'last label'],
        'Target': ['first label', 'second label', 'the    third labels',
                   'fourth label set', 'third label',
                   'third label', 'third label  sets',
                   'fourth label    sets', 'first label', 'last labels']
        }

df = pd.DataFrame(rows, columns=['Id', 'DatasetName', 'Target'])
# Fix Spacing In Columns names
df = df.replace({r'\s+': ' '}, regex=True)
# Get Unique Matches
matches = df.groupby('DatasetName') \
    .apply(lambda x: x['DatasetName'].append(x['Target']).unique()) \
    .agg('|'.join).rename('Target')
# Merge back to original DataFrame
merged = df.drop(columns=['Target']).merge(matches, on='DatasetName', how="left")

# For Display
print(merged.to_string())

Output:

     Id   DatasetName                                           Target
0  xb01   first label                                      first label
1  nt02  second label                                     second label
2  tw02   third label    third label|the third labels|third label sets
3  dt92  fourth label  fourth label|fourth label set|fourth label sets
4  tw03   third label    third label|the third labels|third label sets
5  we04   third label    third label|the third labels|third label sets
6  er04   third label    third label|the third labels|third label sets
7  ew06  fourth label  fourth label|fourth label set|fourth label sets
8  re07   first label                                      first label
9  ti92    last label                           last label|last labels
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57