I would like to find a solution for the following problem:
import pandas as pd
rows = {'Id': ['xb01','nt02','tw02','dt92','tw03','we04','er04','ew06','re07','ti92'],
'DatasetName': ['first label','second label','third label','fourth label','third
label','third label','third label','fourth label','first label','last label'],
'Target': ['first label','second label','the third labels','fourth label
set','third label', 'third label','third label sets','fourth label sets','first
label','last labels']
}
df = pd.DataFrame(rows, columns = ['Id', 'DatasetName','Target'])
print (df)
The dataframe looks like this:
Id DatasetName Target
xb01 first label first label
nt02 second label second label
tw02 third label the third labels
dt92 fourth label fourth label set
tw03 third label third label
we04 third label third label
er04 third label third label sets
ew06 fourth label fourth label sets
re07 first label first label
ti92 last label last labels
pseudo code:
for i in len(range(df)):
if DatasetName[i].is_unique:
if DatasetName[i]!=Target[i]:
Target[i]=DatasetName[i]+ '|'+Target[i]
else:
loop through dataframe and find all labels that belongs to the same DatasetName
and append all those Target names together. (Note: if DatasetName is not same as
Target Name(s), the Dataset name should also append to the Target)
Here we can see:
DatasetName Appeared Target
first label 2 first label
second label 1 second label
third label 4 the third labels | third label | third label sets
fourth label 2 fourth label set | fourth label sets|fourth label
last label 1 last labels | last label
The expect output:
Id DatasetName Target
xb01 first label first label
nt02 second label second label
tw02 third label the third labels|third label|third label sets
dt92 fourth label fourth label set|fourth label sets |fourth label
tw03 third label the third labels|third label|third label sets
we04 third label the third labels|third label|third label sets
er04 third label the third labels|third label|third label sets
ew06 fourth label fourth label set|fourth label sets| fourth label
re07 first label first label
ti92 last label last labels|last label
Note: The real dataframe have 100,000 rows. There might be extra spaces still exist in those strings(I have already implemented dataframe-lower case(), removed all extra marks,etc.). It might have some mistakes (typo) in this question(I have copied and pasted several times), but hopefully you get my idea for what 's the solution I 'm looking for. Thank you!