I have 2 pandas dataframes. I need them to have the same label encoding because I want to use them for machine learning.
dftrain.label.unique()
array(['normal.', 'buffer_overflow.', 'loadmodule.', 'perl.', 'neptune.',
'smurf.', 'guess_passwd.', 'pod.', 'teardrop.', 'portsweep.',
'ipsweep.', 'land.', 'ftp_write.', 'back.', 'imap.', 'satan.',
'phf.', 'nmap.', 'multihop.', 'warezmaster.', 'warezclient.',
'spy.', 'rootkit.'], dtype=object)
dftest.label.unique()
array(['normal.', 'snmpgetattack.', 'named.', 'xlock.', 'smurf.',
'ipsweep.', 'multihop.', 'xsnoop.', 'sendmail.', 'guess_passwd.',
'saint.', 'buffer_overflow.', 'portsweep.', 'pod.', 'apache2.',
'phf.', 'udpstorm.', 'warezmaster.', 'perl.', 'satan.', 'xterm.',
'mscan.', 'processtable.', 'ps.', 'nmap.', 'rootkit.', 'neptune.',
'loadmodule.', 'imap.', 'back.', 'httptunnel.', 'worm.',
'mailbomb.', 'ftp_write.', 'teardrop.', 'land.', 'sqlattack.',
'snmpguess.'], dtype=object)
As you can see there are labels in test set that are not present in train set.
- How can I encode these labels so for example value normal be equal to 1 in both dataframes?
- What should I do with labels from test set that are not present in train set, If I have to remove them how to do it?