Unable to remove part of a string in pandas DataFrame

Question

I'm using the KDDCup to train a Neural Network, but I'm getting rather confused with the layout of the data. When I download the dataset using the scikit-learn dataset function:

data = datasets.fetch_kddcup99(return_X_y = True, )
df = pd.DataFrame(np.column_stack((data[0], data[1])))

and then run the command df.head(), it returns the following information:

  0       1        2      3    4     5      ...     36 37 38 39 40          41
0  0  b'tcp'  b'http'  b'SF'  181  5450     ...      0  0  0  0  0  b'normal.'
1  0  b'tcp'  b'http'  b'SF'  239   486     ...      0  0  0  0  0  b'normal.'
2  0  b'tcp'  b'http'  b'SF'  235  1337     ...      0  0  0  0  0  b'normal.'
3  0  b'tcp'  b'http'  b'SF'  219  1337     ...      0  0  0  0  0  b'normal.'
4  0  b'tcp'  b'http'  b'SF'  217  2032     ...      0  0  0  0  0  b'normal.'

[5 rows x 42 columns]

I'm trying to change the output class (element 41) to just be binary depending on what the label is (if normal then 0, else 1). Now this is proving difficult as the dtype is object, and whenever I do a df.str.contains, it turns ALL samples (half a million of them) to NaN.

I thought a way around this would be to replace b ' with nothing, but I'm not able to successfully do this.

I'm a bit stumped on how to manipulate this dataframe where all columns are type Object, even the scalar values.

Yeah, that' b seems to be in front of every string literal, which then turns the dtype to "Object". — Johnathan Brown, Sep 27 '18 at 09:53

Naga kiran · Accepted Answer · 2018-09-27T10:08:35.743

1

You can use encoding in romoving the character

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.enter link description here

stri="Response from server"    
c.send(stri.encode())

df[41].apply(lambda x: x.decode('utf-8'))

edited Sep 27 '18 at 10:08

answered Sep 27 '18 at 09:59

Naga kiran

4,528
1
17
31

1

Okay I understand. I can run a df[41].apply(lambda x: x.decode('utf-8')) and this correctly displays the data as normal. – Johnathan Brown Sep 27 '18 at 10:02
I was able to adapt your answer and remove all byte literals by first defining all the columns with them in, then looping through them and applying the decode function. I then tested the outcome by retrying the `str.contains` function, which correctly turned my output column row to 1's and 0's. Thanks for the help. – Johnathan Brown Sep 27 '18 at 12:19
Welcome :-) @JohnathanBrown – Naga kiran Sep 27 '18 at 13:24

Unable to remove part of a string in pandas DataFrame

1 Answers1