How to derive KDD99 Features from DARPA pcap file?

Question

I have worked recently with the DARPA network traffic packets and the derived version of it used in KDD99 for intrusion detection evaluation.

Excuse my limited domain knowledge in computer networks, I could only derive 9 features from the DARPA packet headers. and Not the 41 features used in KDD99.

I am intending to continue my work on the UNB ISCX Intrusion Detection Evaluation DataSet. However, I want to derive from the pcap files the 41 features used in the KDD99 and save it in a CSV format. Is there a fast/easy way to achieve this?

score 9 · Accepted Answer · edited Jun 20 '20 at 09:12

Be careful with this data set.

http://www.kdnuggets.com/news/2007/n18/4i.html

Some excerpts:

the artificial data was generated using a closed network, some proprietary network traffic generators, and hand-injected attacks

Among the issues raised, the most important seemed to be that no validation was ever performed to show that the DARPA dataset actually looked like real network traffic.

In 2003, Mahoney and Chan built a trivial intrusion detection system and ran it against the DARPA tcpdump data. They found numerous irregularities, including that -- due to the way the data was generated -- all the malicious packets had a TTL of 126 or 253 whereas almost all the benign packets had a TTL of 127 or 254.

the DARPA dataset (and by extension, the KDD Cup '99 dataset) was fundamentally broken, and one could not draw any conclusions from any experiments run using them

we strongly recommend that (1) all researchers stop using the KDD Cup '99 dataset

As for the feature extraction used. IIRC the majority of features simply were attributes of the parsed IP/TCP/UDP headers. Such as, port number, last octet of IP, and some packet flags.

As such, these findings no longer reflect realistic attacks anymore anyway. Todays TCP/IP stacks are much more robust than at the time the data set was created, where a "ping of death" would instantly lock up a windows host. Every developer of a TCP/IP stack should by now be aware of the risk of such malformed packets and stress-test the stack against such things.

With this, these features have become pretty much meaningless. Incorrectly set SYN flags etc. are no longer used in network attacks; these are much more sophisticated; and most likely no longer attacking the TCP/IP stack, but the services running on the next layer. So I would not bother finding out which low level packet flags were used in that '99 flawed simulation using attacks that worked in the early '90s...

(Realizing that feature extraction needs to be updated over time however is a valueable conclusion to draw from this data set. ;-) ) — Has QUIT--Anony-Mousse, Dec 30 '12 at 12:03
Thank you very much for your input. I am aware about the pitfalls of this dataset and I am planning to use the UNB ISCX Intrusion Detection Evaluation DataSet. However, I am more interested in visualizing the behaviour of the network and try to (soe extent) answer the question "Can we distinguish anomalies related to intrusions from those related to other factors". Therefore, I need to extract as much "meaningful" information from the network traffic. Is there a tool that could help me in achieving this ? — amaatouq, Dec 31 '12 at 07:46
Well, they are not real anomalies, but simulated, and they would not look like this in todays networks anymore. But you can give the text export of wireshark a try. Maybe it can be configured to verbosely list the TCP/IP header flags. Otherwise, you will have to look up the bit positions yourself. But again: they are no longer meaningful for todays networks. — Has QUIT--Anony-Mousse, Dec 31 '12 at 10:07
The UNB ISCX Intrusion Detection Evaluation DataSet (http://www.iscx.ca/dataset) is 2012 and according to some researchers it is one of the very few that do reflect today's network traffic. I tried to use wireshark .. but text exportion doesnt give you a lot of information.. what features from the network traffic do represent today's attack in your opinion ? or if you can hit a source for me :) — amaatouq, Dec 31 '12 at 12:44
Todays attack pattern will mostly require deep packet inspection. I.e. look at the payload, not the raw packets. The top attack pattern is SQL injection. It does not at all show up in TCP headers, but it will look like legitimate traffic until you look at the actual HTTP request. The UNB data set seems to focus on DDoS and brute force attacks, which will likely show up as temporal anomalous micro clusters. But you won't need TCP SYN flags and such. — Has QUIT--Anony-Mousse, Dec 31 '12 at 13:42

How to derive KDD99 Features from DARPA pcap file?

1 Answers1

Linked