Labeling of ISCX Botnet Dataset 2014 in Python

Question

I am having trouble in using the malicious IP information for CIC Botnet Dataset given on their website. It has been mentioned as follows:

Distribution of botnet types in the training dataset

Botnet name | Type | Portion of flows in dataset

Neris | IRC | 21159 (12%)
Rbot | IRC | 39316 (22%)
Virut | HTTP | 1638 (0.94 %)
NSIS | P2P | 4336 (2.48%)
SMTP Spam | P2P | 11296 (6.48%)
Zeus | P2P | 31 (0.01%)
Zeus control (C & C) | P2P | 20 (0.01%)

The resulting set was divided into training and test datasets that included 7 and 16 types of botnets, respectively. Tables 1 and 2 detail distribution and type of botnets in each dataset. Our training dataset is 5.3 GB in size of which 43.92% is malicious and the reminder contains normal flows. Test dataset is 8.5 GB of which 44.97% is malicious flows. We added more diversity of botnet traces in the test dataset than the training dataset in order to evaluate the novelty detection a feature subset can provide. Distribution of botnet types in the test dataset

Botnet name | Type | Portion of flows in dataset

Neris | IRC | 25967 (5.67%)
Rbot | IRC | 83 (0.018%)
Menti | IRC | 2878(0.62%)
Sogou | HTTP | 89 (0.019%)
Murlo | IRC | 4881 (1.06%)
Virut | HTTP | 58576 (12.80%)
NSIS | P2P | 757 (0.165%)
Zeus | P2P | 502 (0.109%)
SMTP Spam | P2P | 21633 (4.72%)
UDP Storm | P2P | 44062 (9.63%)
Tbot | IRC | 1296 (0.283%)
Zero Access | P2P | 1011 (0.221%)
Weasel | P2P | 42313 (9.25%)
Smoke Bot | P2P | 78 (0.017%)
Zeus Control (C&C) | P2P | 31 (0.006%)
ISCX IRC bot | P2P | 1816 (0.387%)

List of malicious IPs

IRC
    192.168.2.112 -> 131.202.243.84
    192.168.5.122 -> 198.164.30.2
    192.168.2.110 -> 192.168.5.122
    192.168.4.118 -> 192.168.5.122
    192.168.2.113 -> 192.168.5.122
    192.168.1.103 -> 192.168.5.122
    192.168.4.120 -> 192.168.5.122
    192.168.2.112 -> 192.168.2.110
    192.168.2.112 -> 192.168.4.120
    192.168.2.112 -> 192.168.1.103
    192.168.2.112 -> 192.168.2.113
    192.168.2.112 -> 192.168.4.118
    192.168.2.112 -> 192.168.2.109
    192.168.2.112 -> 192.168.2.105
    192.168.1.105 -> 192.168.5.122
Neris: 147.32.84.180
RBot: 147.32.84.170
Menti: 147.32.84.150
Sogou: 147.32.84.140
Murlo: 147.32.84.130
Virut: 147.32.84.160
IRCbot and black hole1: 10.0.2.15
Black hole 2: 192.168.106.141
Black hole 3: 192.168.106.131
TBot: 172.16.253.130, 172.16.253.131, 172.16.253.129, 172.16.253.240
Weasel: Botmaster IP: 74.78.117.238; Bot IP: 158.65.110.24
Zeus (zeus sample 1 and 2 and 3, bin_zeus): 192.168.3.35, 192.168.3.25, 192.168.3.65, 172.29.0.116
Osx_trojan: 172.29.0.109
Zero access (zero access 1 and 2): 172.16.253.132, 192.168.248.165
Smoke bot: 10.37.130.4

However, after labeling the flows using the code given below that I had written in python, I assume that either my code is not labeling correctly or the information as mentioned in their website is erroneous.

I will be very grateful if someone can help me in this regard.

Below is the code for labeling in python:

import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler


import numpy as np
import pandas as pd

from scipy import stats



# importing pandas module  
import pandas as pd  

# importing regex module 
import re 

# from tensorflow.keras import backend
from tensorflow.python.keras import backend


import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import xgboost as xgb

import pickle

import gc
gc.collect()


# Load custom functions

import gan


# For reloading after making changes
import importlib
importlib.reload(gan) 
from gan import *

import pandas as pd

import timeit
# code you want to evaluate

%cd $DATA_SET_PATH
!ls

begin_from_start = 0

take_chunk = 0 


if begin_from_start:

    data = pd.read_csv (r'ISCX_Botnet-Training.pcap_Flow.csv', low_memory=False)
    # data = data[0:50000] 



    print(data.shape)

if begin_from_start:
    testing_data = pd.read_csv (r'ISCX_Botnet-Testing.pcap_Flow.csv', low_memory=False)
    # data = data[0:50000] 

    print(testing_data.shape)

if begin_from_start:

    botnet = True
    z_score = False

if begin_from_start:

    #replace inf with nan and then drop the rows with nans
    print("Null Values in data set: " + str(data.isnull().sum().sum()) )

    print(data.shape)

    data = data.replace([np.inf, -np.inf], np.nan).dropna(how="any").reset_index(drop=True)

    print("Null Values in data set: " + str(data.isnull().sum().sum()) )


    print(data.shape)

if begin_from_start:

    #replace inf with nan and then drop the rows with nans
    print("Null Values in data set: " + str(testing_data.isnull().sum().sum()) )

    print(testing_data.shape)

    testing_data = testing_data.replace([np.inf, -np.inf], np.nan).dropna(how="any").reset_index(drop=True)

    print("Null Values in data set: " + str(testing_data.isnull().sum().sum()) )


    print(testing_data.shape)




if begin_from_start:

    # data columns will be all the columns except Src IP, Src Port, Dsp IP, Dst Port
    # and Timestamp as we are not considering categorical and time stamp features.

    # data= data.drop(['Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Timestamp', 'Protocol', \
    #                  'FIN Flag Cnt', 'SYN Flag Cnt', 'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', \
    #                  'CWE Flag Count', 'ECE Flag Cnt'], axis=1)
    if botnet == True: 
    #     data= data.drop(['Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Timestamp', 'Protocol', 'Init Fwd Win Byts'], axis=1)
        data= data.drop(['Src IP', 'Src Port', 'Dst IP', 'Dst Port', 'Timestamp', 'Protocol'], axis=1)


        print(data.shape)

        #In this cell we will find the indices in flows that will show the flows for the particular botnet


        df = data[['Flow ID', 'Label']]

        IRC_1 = df['Flow ID'].str.contains('192.168.2.112-131.202.243.84')
        IRC_2 = df['Flow ID'].str.contains('192.168.5.122-198.164.30.2')
        IRC_3 = df['Flow ID'].str.contains('192.168.2.110-192.168.5.122')
        IRC_4 = df['Flow ID'].str.contains('192.168.4.118-192.168.5.122')
        IRC_5 = df['Flow ID'].str.contains('192.168.2.113-192.168.5.122')
        IRC_6 = df['Flow ID'].str.contains('192.168.1.103-192.168.5.122')
        IRC_7 = df['Flow ID'].str.contains('192.168.4.120-192.168.5.122')
        IRC_8 = df['Flow ID'].str.contains('192.168.2.112-192.168.2.110')
        IRC_9 = df['Flow ID'].str.contains('192.168.2.112-192.168.4.120')
        IRC_10 = df['Flow ID'].str.contains('192.168.2.112-192.168.1.103')
        IRC_11 = df['Flow ID'].str.contains('192.168.2.112-192.168.2.113')
        IRC_12 = df['Flow ID'].str.contains('192.168.2.112-192.168.4.118')
        IRC_13 = df['Flow ID'].str.contains('192.168.2.112-192.168.2.109')
        IRC_14 = df['Flow ID'].str.contains('192.168.2.112-192.168.2.105')
        IRC_15 = df['Flow ID'].str.contains('192.168.1.105-192.168.5.122')


        Neris = df['Flow ID'].str.contains('147.32.84.180')
        RBot  = df['Flow ID'].str.contains('147.32.84.170')
        Menti = df['Flow ID'].str.contains('147.32.84.150')
        Sogou = df['Flow ID'].str.contains('147.32.84.140')
        Murlo = df['Flow ID'].str.contains('147.32.84.130')
        Virut = df['Flow ID'].str.contains('147.32.84.160')
        IRCbot_and_black_hole_1 = df['Flow ID'].str.contains('10.0.2.15')
        Black_hole_2 = df['Flow ID'].str.contains('192.168.106.141')
        Black_hole_3 = df['Flow ID'].str.contains('192.168.106.131')
        TBot_1 = df['Flow ID'].str.contains('172.16.253.130')
        TBot_2 = df['Flow ID'].str.contains('172.16.253.131')
        TBot_3 = df['Flow ID'].str.contains('172.16.253.129')
        TBot_4 = df['Flow ID'].str.contains('172.16.253.240')
        Weasel_master = df['Flow ID'].str.contains('74.78.117.238')
        Weasel_bot = df['Flow ID'].str.contains('158.65.110.24')
        Zeus_1  = df['Flow ID'].str.contains('192.168.3.35')
        Zeus_2 = df['Flow ID'].str.contains('192.168.3.25')
        Zeus_3 = df['Flow ID'].str.contains('192.168.3.65')
        bin_Zeus = df['Flow ID'].str.contains('172.29.0.116')
        Osx_trojan = df['Flow ID'].str.contains('172.29.0.109')
        zero_access_1 = df['Flow ID'].str.contains('172.16.253.132')
        zero_access_2 = df['Flow ID'].str.contains('192.168.248.165')
        Smoke_bot = df['Flow ID'].str.contains('10.37.130.4')





        indx_IRC_1 = [i for i, x in enumerate(IRC_1) if x]
        indx_IRC_2 = [i for i, x in enumerate(IRC_2) if x]
        indx_IRC_3 = [i for i, x in enumerate(IRC_3) if x]
        indx_IRC_4 = [i for i, x in enumerate(IRC_4) if x]
        indx_IRC_5 = [i for i, x in enumerate(IRC_5) if x]
        indx_IRC_6 = [i for i, x in enumerate(IRC_6) if x]
        indx_IRC_7 = [i for i, x in enumerate(IRC_7) if x]
        indx_IRC_8 = [i for i, x in enumerate(IRC_8) if x]
        indx_IRC_9 = [i for i, x in enumerate(IRC_9) if x]
        indx_IRC_10 = [i for i, x in enumerate(IRC_10) if x]
        indx_IRC_11 = [i for i, x in enumerate(IRC_11) if x]
        indx_IRC_12 = [i for i, x in enumerate(IRC_12) if x]
        indx_IRC_13 = [i for i, x in enumerate(IRC_13) if x]
        indx_IRC_14 = [i for i, x in enumerate(IRC_14) if x]
        indx_IRC_15 = [i for i, x in enumerate(IRC_15) if x]


        indx_Neris = [i for i, x in enumerate(Neris) if x]
        indx_RBot  = [i for i, x in enumerate(RBot) if x]
        indx_Menti = [i for i, x in enumerate(Menti) if x]
        indx_Sogou = [i for i, x in enumerate(Sogou) if x]
        indx_Murlo = [i for i, x in enumerate(Murlo) if x]
        indx_Virut = [i for i, x in enumerate(Virut) if x]
        indx_IRCbot_and_black_hole_1 = [i for i, x in enumerate(IRCbot_and_black_hole_1) if x]
        indx_Black_hole_2 = [i for i, x in enumerate(Black_hole_2) if x]
        indx_Black_hole_3 = [i for i, x in enumerate(Black_hole_3) if x]
        indx_TBot_1 = [i for i, x in enumerate(TBot_1) if x]
        indx_TBot_2 = [i for i, x in enumerate(TBot_2) if x]
        indx_TBot_3 = [i for i, x in enumerate(TBot_3) if x]
        indx_TBot_4 = [i for i, x in enumerate(TBot_4) if x]
        indx_Weasel_master = [i for i, x in enumerate(Weasel_master) if x]
        indx_Weasel_bot = [i for i, x in enumerate(Weasel_bot) if x]
        indx_Zeus_1  = [i for i, x in enumerate(Zeus_1) if x]
        indx_Zeus_2 = [i for i, x in enumerate(Zeus_2) if x]
        indx_Zeus_3 = [i for i, x in enumerate(Zeus_3) if x]
        indx_bin_Zeus = [i for i, x in enumerate(bin_Zeus) if x]
        indx_Osx_trojan = [i for i, x in enumerate(Osx_trojan) if x]
        indx_zero_access_1 = [i for i, x in enumerate(zero_access_1) if x]
        indx_zero_access_2 = [i for i, x in enumerate(zero_access_2) if x]
        indx_Smoke_bot = [i for i, x in enumerate(Smoke_bot) if x]
        indx_zero_access_1 = [i for i, x in enumerate(zero_access_1) if x]
        indx_zero_access_2 = [i for i, x in enumerate(zero_access_2) if x]
        indx_Smoke_bot = [i for i, x in enumerate(Smoke_bot) if x]


        total_instances = df.shape[0]
        print("Total Instances:" + str(total_instances))

        print("bin_IRC_1_Instances:" + str(len(indx_IRC_1))+ " ---> "+ str(round(len(indx_IRC_1)/total_instances*100, 4)) + " %")
        print("bin_IRC_2_Instances:" + str(len(indx_IRC_2))+ " ---> "+ str(round(len(indx_IRC_2)/total_instances*100, 4)) + " %")
        print("bin_IRC_3_Instances:" + str(len(indx_IRC_3))+ " ---> "+ str(round(len(indx_IRC_3)/total_instances*100, 4)) + " %")
        print("bin_IRC_4_Instances:" + str(len(indx_IRC_4))+ " ---> "+ str(round(len(indx_IRC_4)/total_instances*100, 4)) + " %")
        print("bin_IRC_5_Instances:" + str(len(indx_IRC_5))+ " ---> "+ str(round(len(indx_IRC_5)/total_instances*100, 4)) + " %")
        print("bin_IRC_6_Instances:" + str(len(indx_IRC_6))+ " ---> "+ str(round(len(indx_IRC_6)/total_instances*100, 4)) + " %")
        print("bin_IRC_7_Instances:" + str(len(indx_IRC_7))+ " ---> "+ str(round(len(indx_IRC_7)/total_instances*100, 4)) + " %")
        print("bin_IRC_8_Instances:" + str(len(indx_IRC_8))+ " ---> "+ str(round(len(indx_IRC_8)/total_instances*100, 4)) + " %")
        print("bin_IRC_9_Instances:" + str(len(indx_IRC_9))+ " ---> "+ str(round(len(indx_IRC_9)/total_instances*100, 4)) + " %")
        print("bin_IRC_10_Instances:" + str(len(indx_IRC_10))+ " ---> "+ str(round(len(indx_IRC_10)/total_instances*100, 4)) + " %")
        print("bin_IRC_11_Instances:" + str(len(indx_IRC_11))+ " ---> "+ str(round(len(indx_IRC_11)/total_instances*100, 4)) + " %")
        print("bin_IRC_12_Instances:" + str(len(indx_IRC_12))+ " ---> "+ str(round(len(indx_IRC_12)/total_instances*100, 4)) + " %")
        print("bin_IRC_13_Instances:" + str(len(indx_IRC_13))+ " ---> "+ str(round(len(indx_IRC_13)/total_instances*100, 4)) + " %")
        print("bin_IRC_14_Instances:" + str(len(indx_IRC_14))+ " ---> "+ str(round(len(indx_IRC_14)/total_instances*100, 4)) + " %")
        print("bin_IRC_15_Instances:" + str(len(indx_IRC_15))+ " ---> "+ str(round(len(indx_IRC_15)/total_instances*100, 4)) + " %")



        print("Neris_Instances:" + str(len(indx_Neris)) + " ---> "+ str(round(len(indx_Neris)/total_instances*100, 4)) + " %")
        print("RBot_Instances:" + str(len(indx_RBot)) + " ---> "+ str(round(len(indx_RBot)/total_instances*100, 4)) + " %")
        print("Menti_Instances:" + str(len(indx_Menti)) + " ---> "+ str(round(len(indx_Menti)/total_instances*100, 4)) + " %")
        print("Sogou_Instances:" + str(len(indx_Sogou)) + " ---> "+ str(round(len(indx_Sogou)/total_instances*100, 4)) + " %")
        print("Murlo_Instances:" + str(len(indx_Murlo)) + " ---> "+ str(round(len(indx_Murlo)/total_instances*100, 4)) + " %")
        print("Virut_Instances:" + str(len(indx_Virut)) + " ---> "+ str(round(len(indx_Virut)/total_instances*100, 4)) + " %")
        print("IRCbot_and_black_hole_1_Instances:" + str(len(indx_IRCbot_and_black_hole_1)) + " ---> "+ str(round(len(indx_IRCbot_and_black_hole_1)/total_instances*100, 4)) + " %")
        print("Black_hole_2_Instances:" + str(len(indx_Black_hole_2)) + " ---> "+ str(round(len(indx_Black_hole_2)/total_instances*100, 4)) + " %")
        print("Black_hole_3_Instances:" + str(len(indx_Black_hole_3)) + " ---> "+ str(round(len(indx_Black_hole_3)/total_instances*100, 4)) + " %")
        print("TBot_1_Instances:" + str(len(indx_TBot_1)) + " ---> "+ str(round(len(indx_TBot_1)/total_instances*100, 4)) + " %")
        print("TBot_2_Instances:" + str(len(indx_TBot_2)) + " ---> "+ str(round(len(indx_TBot_2)/total_instances*100, 4)) + " %")
        print("TBot_3_Instances:" + str(len(indx_TBot_3)) + " ---> "+ str(round(len(indx_TBot_3)/total_instances*100, 4)) + " %")
        print("TBot_4_Instances:" + str(len(indx_TBot_4)) + " ---> "+ str(round(len(indx_TBot_4)/total_instances*100, 4)) + " %")
        print("Weasel_master_Instances:" + str(len(indx_Weasel_master)) + " ---> "+ str(round(len(indx_Weasel_master)/total_instances*100, 4)) + " %")
        print("Weasel_bot_Instances:" + str(len(indx_Weasel_bot)) + " ---> "+ str(round(len(indx_Weasel_bot)/total_instances*100, 4)) + " %")
        print("Zeus_1_Instances:" + str(len(indx_Zeus_1)) + " ---> "+ str(round(len(indx_Zeus_1)/total_instances*100, 4)) + " %")
        print("Zeus_2_Instances:" + str(len(indx_Zeus_2)) + " ---> "+ str(round(round(len(indx_Zeus_2)/total_instances*100, 4), 2)) + " %")
        print("Zeus_3_Instances:" + str(len(indx_Zeus_3)) + " ---> "+ str(round(len(indx_Zeus_3)/total_instances*100, 4)) + " %")
        print("bin_Zeus_Instances:" + str(len(indx_Zeus_3)) + " ---> "+ str(round(len(indx_Zeus_3)/total_instances*100, 4)) + " %")
        print("Osx_trojan_Instances:" + str(len(indx_Osx_trojan)) + " ---> "+ str(round(len(indx_Osx_trojan)/total_instances*100, 4)) + " %")
        print("zero_access_1_Instances:" + str(len(indx_zero_access_1)) + " ---> "+ str(round(len(indx_zero_access_1)/total_instances*100, 4)) + " %")
        print("zero_access_2_Instances:" + str(len(indx_zero_access_2)) + " ---> "+ str(round(len(indx_zero_access_2)/total_instances*100, 4)) + " %")
        print("Smoke_bot_Instances:" + str(len(indx_Smoke_bot)) + " ---> "+ str(round(len(indx_Smoke_bot)/total_instances*100, 4)) + " %")


%%time
if begin_from_start:

    if botnet == True:

        # This cell labels the 'Label' column in the data frame to 1 where the particular botnet was found

        data.loc[:, 'Label'] = 0.0

        data.loc[indx_IRC_2, 'Label'] = 1
        data.loc[indx_IRC_3, 'Label'] = 1
        data.loc[indx_IRC_4, 'Label'] = 1
        data.loc[indx_IRC_5, 'Label'] = 1
        data.loc[indx_IRC_6, 'Label'] = 1
        data.loc[indx_IRC_7, 'Label'] = 1
        data.loc[indx_IRC_11, 'Label'] = 1
        data.loc[indx_IRC_15, 'Label'] = 1
        data.loc[indx_Neris, 'Label'] = 1

        data.loc[indx_RBot, 'Label'] = 1

        data.loc[indx_Virut, 'Label'] = 1

        data.loc[indx_Zeus_2, 'Label'] = 1


    #     print(data['Label'])

I will be grateful if someone can give me an advice in this regard if I am labeling the flows correctly or not.

Hello Rizwan, did you manage to find the answer to your question? I'm struggling with the same issue now... also, how did you extract the network flows (those CSV files), as their site provides PCAPs only. — WindyFields, Apr 12 '21 at 16:33
@WindyFields sorry for late response. I am currently working on this dataset. you can use wireshark to convert .pcap to .csv file. — Manish, Jun 02 '22 at 12:24

Labeling of ISCX Botnet Dataset 2014 in Python

0 Answers0