16

I know how to read binary files in Python using NumPy's np.fromfile() function. The issue I'm faced with is that when I do so, the array has exceedingly large numbers of the order of 10^100 or so, with random nan and inf values.

I need to apply machine learning algorithms to this dataset and I cannot work with this data. I cannot normalise the dataset because of the nan values.

I've tried np.nan_to_num() but that doesn't seem to work. After doing so, my min and max values range from 3e-38 and 3e+38 respectively, so I could not normalize it.

Is there any way to scale this data down? If not, how should I deal with this?

Thank you.

EDIT:

Some context. I'm working on a malware classification problem. My dataset consists of live malware binaries. They are files of the type .exe, .apk etc. My idea is store these binaries as a numpy array, convert to a grayscale image and then perform pattern analysis on it.

Suyash Shetty
  • 513
  • 3
  • 8
  • 17

2 Answers2

32

If you want to make an image out of a binary file, you need to read it in as integer, not float. Currently, the most common format for images is unsigned 8-bit integers.

As an example, let's make an image out of the first 10,000 bytes of /bin/bash:

>>> import numpy as np
>>> import cv2
>>> xbash = np.fromfile('/bin/bash', dtype='uint8')
>>> xbash.shape
(1086744,)
>>> cv2.imwrite('bash1.png', xbash[:10000].reshape(100,100))

In the above, we used the OpenCV library to write the integers to a PNG file. Any of several other imaging libraries could have been used.

This what the first 10,000 bytes of bash "looks" like:

enter image description here

John1024
  • 109,961
  • 14
  • 137
  • 171
  • 2
    This worked! I was reading the binary files as floats, which caused the error. I read it as uint8 and it worked fine. Thanks – Suyash Shetty Sep 29 '16 at 06:21
1

EDIT 2

Numpy integer nan
Accepted answer states:NaN can't be stored in an integer array. A nan is a special value for float arrays only. There are talks about introducing a special bit that would allow non-float arrays to store what in practice would correspond to a nan, but so far (2012/10), it's only talks. In the meantime, you may want to consider the numpy.ma package: instead of picking an invalid integer like -99999, you could use the special numpy.ma.masked value to represent an invalid value.

a = np.ma.array([1,2,3,4,5], dtype=int)
a[1] = np.ma.masked
masked_array(data = [1 -- 3 4 5],
             mask = [False  True False False False],
       fill_value = 999999)

EDIT 1

To read binary file:

  1. Read the binary file content like this:

    with open(fileName, mode='rb') as file: # b is important -> binary
        fileContent = file.read()
    

    After that you can "unpack" binary data using struct.unpack

  2. If you are using np.fromfile() function:

    numpy.fromfile, which can read data from both text and binary files. You would first construct a data type, which represents your file format, using numpy.dtype, and then read this type from file using numpy.fromfile.

Community
  • 1
  • 1
Sayali Sonawane
  • 12,289
  • 5
  • 46
  • 47