0

I have 2 pandas dataframes. I need them to have the same label encoding because I want to use them for machine learning.

dftrain.label.unique()

array(['normal.', 'buffer_overflow.', 'loadmodule.', 'perl.', 'neptune.',
       'smurf.', 'guess_passwd.', 'pod.', 'teardrop.', 'portsweep.',
       'ipsweep.', 'land.', 'ftp_write.', 'back.', 'imap.', 'satan.',
       'phf.', 'nmap.', 'multihop.', 'warezmaster.', 'warezclient.',
       'spy.', 'rootkit.'], dtype=object)

dftest.label.unique()

array(['normal.', 'snmpgetattack.', 'named.', 'xlock.', 'smurf.',
       'ipsweep.', 'multihop.', 'xsnoop.', 'sendmail.', 'guess_passwd.',
       'saint.', 'buffer_overflow.', 'portsweep.', 'pod.', 'apache2.',
       'phf.', 'udpstorm.', 'warezmaster.', 'perl.', 'satan.', 'xterm.',
       'mscan.', 'processtable.', 'ps.', 'nmap.', 'rootkit.', 'neptune.',
       'loadmodule.', 'imap.', 'back.', 'httptunnel.', 'worm.',
       'mailbomb.', 'ftp_write.', 'teardrop.', 'land.', 'sqlattack.',
       'snmpguess.'], dtype=object)

As you can see there are labels in test set that are not present in train set.

  1. How can I encode these labels so for example value normal be equal to 1 in both dataframes?
  2. What should I do with labels from test set that are not present in train set, If I have to remove them how to do it?
j doe
  • 125
  • 2
  • 10
  • 1
    Combine both labels lists, train a labelEncoder using sklearn and apply the trained model on each list separately: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html – Mohamed Ali JAMAOUI Oct 04 '19 at 08:55
  • 1
    Possible duplicate of [Sklearn Label Encoding multiple columns pandas dataframe](https://stackoverflow.com/questions/44474570/sklearn-label-encoding-multiple-columns-pandas-dataframe) – Karan Sethi Oct 04 '19 at 08:56
  • @KaranSethi thats multiple columns. I want 2 dataframes not columns. – j doe Oct 04 '19 at 09:00
  • @MohamedAliJAMAOUI good Idea can you give me the code for that? – j doe Oct 04 '19 at 09:01
  • 3
    @jdoe you have to try first and people can help you when you are bocked. Notice than in pandas you can concatenate two dataframe into a single one using pd.concat (check examples in the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) .. then trained your labelEncoder, then apply it on the original two separate dataframes. – Mohamed Ali JAMAOUI Oct 04 '19 at 09:23

0 Answers0