I'm using the KDDCup to train a Neural Network, but I'm getting rather confused with the layout of the data. When I download the dataset using the scikit-learn dataset function:
data = datasets.fetch_kddcup99(return_X_y = True, )
df = pd.DataFrame(np.column_stack((data[0], data[1])))
and then run the command df.head()
, it returns the following information:
0 1 2 3 4 5 ... 36 37 38 39 40 41
0 0 b'tcp' b'http' b'SF' 181 5450 ... 0 0 0 0 0 b'normal.'
1 0 b'tcp' b'http' b'SF' 239 486 ... 0 0 0 0 0 b'normal.'
2 0 b'tcp' b'http' b'SF' 235 1337 ... 0 0 0 0 0 b'normal.'
3 0 b'tcp' b'http' b'SF' 219 1337 ... 0 0 0 0 0 b'normal.'
4 0 b'tcp' b'http' b'SF' 217 2032 ... 0 0 0 0 0 b'normal.'
[5 rows x 42 columns]
I'm trying to change the output class (element 41) to just be binary depending on what the label is (if normal then 0, else 1). Now this is proving difficult as the dtype is object, and whenever I do a df.str.contains
, it turns ALL samples (half a million of them) to NaN
.
I thought a way around this would be to replace b '
with nothing, but I'm not able to successfully do this.
I'm a bit stumped on how to manipulate this dataframe where all columns are type Object
, even the scalar values.