2

I'm working on the ISCXVPN2016 dataset, it consists of some pcap files (each pcap is captured traffic of a specific app such as skype, youtube, etc.) and I have converted them to pickle files and then write them into a text file using code below:

pkl = open("AIMchat2.pcapng.pickle", "rb")
with open('file.txt', 'w') as f:
    for Item in pkl:
        f.write('%s\n' %Item)

file.txt:

b'\x80\x03]q\x00(cnumpy.core.multiarray\n' b'_reconstruct\n' b'q\x01cnumpy\n' b'ndarray\n' b'q\x02K\x00\x85q\x03C\x01bq\x04\x87q\x05Rq\x06(K\x01K\x9d\x85q\x07cnumpy\n' b'dtype\n' b'q\x08X\x02\x00\x00\x00u1q\tK\x00K\x01\x87q\n' b'Rq\x0b(K\x03X\x01\x00\x00\x00|q\x0cNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\rb\x89C\x9dE\x00\x00\x9dU\xbc@\x00\x80\x06\xd7\xc9\x83\xca\xf0W@\x0c\x18\xa74I\x01\xbb\t].\xc8\xf3*\xc51P\x18\xfa[)j\x00\x00\x17\x03\x02\x00p\x14\x90\xccY|\xa3\x7f\xd1\x12\xe2\xb4.U9)\xf20\xf1{\xbd\x1d\xa3W\x0c\x19\xc2\xf0\x8c\x0b\x8c\x86\x16\x99\xd8:\x19\xb0G\xe7\xb2\xf4\x9d\x82\x8e&a\x04\xf2\xa2\x8e\xce\xa4b\xcc\xfb\xe4\xd0\xde\x89eUU]\x1e\xfeF\x9bv\x88\xf4\xf3\xdc\x8f\xde\xa6Kk1q`\x94]\x13\xd7|\xa3\x16\xce\xcc\x1b\xa7\x10\xc5\xbd\x00\xe8M\x8b\x05v\x95\xa3\x8c\xd0\x83\xc1\xf1\x12\xee\x9f\xefmq\x0etq\x0fbh\x01h\x02K\x00\x85q\x10h\x04\x87q\x11Rq\x12(K\x01K.\x85q\x13h\x0b\x89C.E\x00\x00

My question is how I can compute the entropy of each pickle file?

(I have updated the question)

Nebula
  • 159
  • 2
  • 2
  • 16
  • 3
    please define entropy – Marat Jan 04 '20 at 05:42
  • 2
    What about [How to calculate the entropy of a file?](https://stackoverflow.com/questions/990477/how-to-calculate-the-entropy-of-a-file) – ventaquil Jan 04 '20 at 06:31
  • 1
    If you need a rigorous process and determined value please comment. – noobmaster69 Jan 04 '20 at 07:10
  • @Marat Entropy is a measure of randomness of data. But if you mean which kind of entropy there are some ways, for now I can simply use Shannon Entropy. – Nebula Jan 04 '20 at 07:13
  • @ventaquil Actually I saw that, but couldn't write the python code, I'm kinda new to python. – Nebula Jan 04 '20 at 07:14
  • There's also this code in python 2, but I encounter some errors I couldn't solve.[Calculate entropy of a file](https://stackoverflow.com/questions/18962990/calculate-entropy-of-a-file) TypeError: ord() expected string of length 1, but int found – Nebula Jan 04 '20 at 07:20

3 Answers3

2

A naive solution would be gzip/tar the file. Determine the entropy with the calculation of (size-of-gzipped/tar-file)/(size-of-original) as measure of randomness.
This result isn't accurate as neither gzip nor tar is an "ideal" compressor but the result will be more accurate as the file size grows.
A good choice to use a written python code to check entropy would be this:
http://code.activestate.com/recipes/577476-shannon-entropy-calculation/#c3

noobmaster69
  • 182
  • 1
  • 3
  • 13
  • 1
    Thanks for your solution, I think I read about this before but I kinda need the precise number because I need to give it to deep neural network in order to see if the results get better in comparison to other works or not. – Nebula Jan 04 '20 at 07:16
  • 1
    Ok I just saw the link, when I run this code, I encounter this error, TypeError: ord() expected string of length 1, but int found. That couldn't solve it. – Nebula Jan 04 '20 at 07:23
  • 1
    It was python 2 by the way. For sanity checking, what's your used python version? – noobmaster69 Jan 04 '20 at 07:25
  • 1
    Yes exactly, I'm using 3.7. I've searched for the error but couldn't find a proper solution. – Nebula Jan 04 '20 at 07:27
  • 1
    It was used in Python 2. Please try to execute with Python 2.7 versions or later but not Python 3.x. – noobmaster69 Jan 04 '20 at 07:29
  • 1
    Tried using Python 2.7 as it is the least version on my Conda that I can choose, But encounter an error. – Nebula Jan 04 '20 at 08:32
2

If I do nothing wrong this is the answer (based on How to calculate the entropy of a file? and Shannon entropy).

#!/usr/bin/env python3

import math


filename = "random_data.bin"

with open(filename, "rb") as file:
    counters = {byte: 0 for byte in range(2 ** 8)}  # start all counters with zeros

    for byte in file.read():  # read in chunks for large files
        counters[byte] += 1  # increase counter for specified byte

    filesize = file.tell()  # we can get file size by reading current position

    probabilities = [counter / filesize for counter in counters.values()]  # calculate probabilities for each byte

    entropy = -sum(probability * math.log2(probability) for probability in probabilities if probability > 0)  # final sum

    print(entropy)

Checked with ent program on Ubuntu 18.04 with Python 3.6.9:

$ dd if=/dev/urandom of=random_data.bin bs=1K count=16
16+0 records in
16+0 records out
16384 bytes (16 kB, 16 KiB) copied, 0.0012111 s, 13.5 MB/s
$ ent random_data.bin
Entropy = 7.988752 bits per byte.
...
$ ./calc_entropy.py
7.988751920202076

Tested with text file too.

$ ent calc_entropy.py
Entropy = 4.613356 bits per byte.
...
$ ./calc_entropy.py
4.613355601248316
ventaquil
  • 2,780
  • 3
  • 23
  • 48
  • Thanks, this gives some numbers for each file, I should just find out more about the correctness of numbers. – Nebula Jan 04 '20 at 08:39
  • 1
    @Nebula I tested it with publicly available program `ent` so there is probability that everything works correct. If you find any bug feel free to point it here. – ventaquil Jan 04 '20 at 08:44
  • Would you please tell me how I can test my file with ent? I have installed it, but when I enter `ent entropy.py` it says: `Cannot open file entropy.py` – Nebula Jan 05 '20 at 07:51
  • @Nebula from `man` we know `ent [options] [file]` - are you sure that file `entropy.py` exists? – ventaquil Jan 05 '20 at 08:08
  • The file `entropy.py` is on the desktop, and when I enter `dpkg -L ent` the final line is `/usr/share/man/man1/ent.1.gz`. – Nebula Jan 05 '20 at 09:14
1

You could use BiEntropy, Trientropy or their addition TriBientropy to compute the entropy of your pickle files. The algorithms are described on www.arxiv.org and BiEntropy has been implemented with test harnesses on Github. BiEntropy has been tested positevely on large raw binary files

  • Thank you. I just found your paper, and found some repo on github. So, these pickles were pcaps of encrypted traffic so I think entropy of these files would be high, and the difference of apps couldn't be distinguished, am I right here? – Nebula Jan 08 '20 at 18:43
  • 1
    Yep, BiEntropy of encrypted files will be high. If the encryption is weak, you may even be able to discriminate content. TriBiEntropy has not yet been tested (so far as I am aware!) on encrypted traffic, so there may be some surprises to be enjoyed. If you check the references you will see a paper on traffic classification. – Grenville Croll Jan 12 '20 at 23:42