I have a set of generated data describing web connections in CSV that looks like this:
conn_duration,conn_destination,response_size,response_code,is_malicious
1564,130,279,532,302,0
1024,200,627,1032,307,0
2940,130,456,3101,201,1
Full CSV here
The class indicates which ones are of interest based on duration, destination_id and response code.
I think LogisticRegression would be a good fit here but the results I'm getting aren't great. On the generated dataset I've got 750 rows with a 0 class and 150 with a 1.
This is how I'm manipulating and providing the data:
names = ['conn_duration', 'conn_destination', 'response_size', 'response_code', 'is_malicious']
dataframe = pandas.read_csv(path, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:4]
y = array[:,4]
scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])
model = LogisticRegression()
model.fit(X, y)
# Two test bits of data, expect the first to be predicted 1 and the second to be 0
Xnew = [[[3492, 150, 750, 200]], [[3492, 120, 901, 200]]]
for conn in Xnew:
# make a prediction
ynew = model.predict(conn)
print("X=%s, Predicted=%s" % (conn[0], ynew[0]))
The criteria for a malicious bit of traffic is that the response code is 200, conn_destination is 150, and the response size is greater than 500.
I'm getting reasonable prediction but wonder if LogisticRegression is the right algorithm to be using?
TIA!