0

I have a set of generated data describing web connections in CSV that looks like this:

conn_duration,conn_destination,response_size,response_code,is_malicious
1564,130,279,532,302,0
1024,200,627,1032,307,0
2940,130,456,3101,201,1

Full CSV here

The class indicates which ones are of interest based on duration, destination_id and response code.

I think LogisticRegression would be a good fit here but the results I'm getting aren't great. On the generated dataset I've got 750 rows with a 0 class and 150 with a 1.

This is how I'm manipulating and providing the data:

names = ['conn_duration', 'conn_destination', 'response_size', 'response_code', 'is_malicious']
dataframe = pandas.read_csv(path, names=names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:4]
y = array[:,4]

scaler = Normalizer().fit(X)
normalizedX = scaler.transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(normalizedX[0:5,:])

model = LogisticRegression()
model.fit(X, y)

# Two test bits of data, expect the first to be predicted 1 and the second to be 0
Xnew = [[[3492, 150, 750, 200]], [[3492, 120, 901, 200]]]

for conn in Xnew:
    # make a prediction
    ynew = model.predict(conn)
    print("X=%s, Predicted=%s" % (conn[0], ynew[0]))

The criteria for a malicious bit of traffic is that the response code is 200, conn_destination is 150, and the response size is greater than 500.

I'm getting reasonable prediction but wonder if LogisticRegression is the right algorithm to be using?

TIA!

James MV
  • 8,569
  • 17
  • 65
  • 96

3 Answers3

1

If the code is working, but you aren't sure what algorithm to use, I would recommend trying an SVM, random forest, etc. Use the GridSearchCV module to determine which algorithm gives the best performance.

whenitrains
  • 468
  • 5
  • 21
1

Since there's a simple rule to classify the traffic, as "response code is 200, conn_destination is 150, and the response size is greater than 500", you don't actually need a model to solve it. Don't overkill a simple problem.

For studying purposes it's ok, but the model should get very close to 100% because it should learn this rule.

Anyway, conn_destination and response_code are categorical data, if you directly normalize it the algorith will understand 200 closer to 201 then to 300, but they categories not numbers.

Here's a reference of some ways to threat categorical data: Linear regression analysis with string/categorical features (variables)?

1

I would try XGBoost (Extreme Gradient Boosted Trees). In large datasets SVM is computationally costly and I specially like Random Forests when you have a highly imbalanced dataset.

Logistic regression can be part of a Neural Network, if you want to develop something more accurate and sophisticated, like tuning hyperparameters, avoiding overfitting and increasing generalization properties. You can also do that in XGBoost, by pruning trees.

XGBoost and Neural Networks would be my choices for a classification problem. but the whole thing is bigger than that. It's not about choosing an algorithm, but understand how it works, what is going on under the hood and HOW you can adjust it in a way you can accurately predict classes.

Also, data preparation, variable selection, outlier detection, descriptive statistics are very important for the quality and accuracy of your model.

razimbres
  • 4,715
  • 5
  • 23
  • 50