1

I have a dataset of 1,00,000+ .IMG files that I need to convert to .PNG / .JPG format to apply CNN for a simple classification task.
I referred to this answer and the solution works for me partially. What I mean is that some images are not properly converted. The reason for that, according to my understanding is that some images have a Pixel Depth of 16 while some have 8.

for file in fileList:
    rawData = open(file, 'rb').read()
    size = re.search("(LINES              = \d\d\d\d)|(LINES              = \d\d\d)", str(rawData))
    pixelDepth = re.search("(SAMPLE_BITS        = \d\d)|(SAMPLE_BITS        = \d)", str(rawData))
    size = (str(size)[-6:-2])
    pixelDepth = (str(pixelDepth)[-4:-2])
    print(int(size))
    print(int(pixelDepth))
    imgSize = (int(size), int(size))



    img = Image.frombytes('L', imgSize, rawData)
    img.save(str(file)+'.jpg')


Data Source: NASA Messenger Mission
.IMG files and their corresponding converted .JPG Files


Files with Pixel Depth of 8 are successfully converted:
enter image description here


Files with Pixel Depth of 16 are NOT properly converted:
enter image description here

Please let me know if there's any more information that I should provide.

Harshit Jindal
  • 621
  • 8
  • 26
  • Also, the top edge part of all images doesn't seem right. Could it be because of PIL trying to convert metadata at the top of the file into image? – Harshit Jindal Mar 08 '20 at 07:39
  • Do you only want to change the Image extension? – jizhihaoSAMA Mar 08 '20 at 08:07
  • @jizhihaoSAMA I want to convert the image – Harshit Jindal Mar 08 '20 at 08:27
  • So just ``.save() `` can change `.img` to `.png` or `.jpg` – jizhihaoSAMA Mar 08 '20 at 09:36
  • I don't believe your posted code can possibly work for any of your images. Please clarify if you have used it successfully at all. – Mark Setchell Mar 08 '20 at 10:19
  • Also, your second "IMG" file appears to link to a file called "maps.html"? – Mark Setchell Mar 08 '20 at 10:21
  • 1
    The pixels in the file EW0220137564B.IMG have a mean value of 252 and a standard deviation of just 3, so there is not anything very interesting in that image. If you provide some more IMG samples I'll take a closer look. – Mark Setchell Mar 08 '20 at 10:35
  • 1
    "Could it be because of PIL trying to convert metadata at the top of the file into image?" It converts what you feed it, so that is By Design. Did you think of downsampling to 8 bits? – Jongware Mar 08 '20 at 12:04
  • @MarkSetchell Yes this provided code DOES work correctly for all images other than those with 16 bit depth (ie: works for 512x512, 1024x1024 images with bit depth of 8). If you want, I can provide the complete code so that you could try it out – Harshit Jindal Mar 08 '20 at 13:51
  • @usr2564301 Thank you, i'll try reading the data starting from the end of the header. No, I didn't try to downsample it, how would that help? I'm new to image processing – Harshit Jindal Mar 08 '20 at 13:52
  • @MarkSetchell I don't understand what you mean by the second IMG file linked to "maps.html". Where did you see this? – Harshit Jindal Mar 08 '20 at 13:55
  • @jizhihaoSAMA That wouldn't work. It'll only change the file extension, that's not what I want – Harshit Jindal Mar 08 '20 at 13:55
  • @MarkSetchell The point that you made about the mean and SD, that makes sense. Thank you for pointing that out. All the data is picked up from NASA MESSENGER spacecraft, and there are some IMG files in there which do not contain any relevant information (ie: completely black, or just streaks of black and white lines) – Harshit Jindal Mar 08 '20 at 14:00
  • @MarkSetchell Here is the data source: https://pdsimage2.wr.usgs.gov/archive/mess-e_v_h-mdis-2-edr-rawdata-v1.0/MSGRMDS_1001/DATA/ I have also included the link in the description. It has all the IMG files that I have in my dataset – Harshit Jindal Mar 08 '20 at 14:03
  • Please post the code that works for 8-bit images, because the code you show will certainly not work. – Mark Setchell Mar 08 '20 at 21:53

1 Answers1

3

Hopefully, from my other answer, here, you now have a better understanding of how your files are formatted. So, the code should look something like this:

#!/usr/bin/env python3

import sys
import re
import numpy as np
from PIL import Image
import cv2

rawData  = open('EW0220137564B.IMG', 'rb').read()
# File size in bytes
fs       = len(rawData)
bitDepth = int(re.search("SAMPLE_BITS\s+=\s+(\d+)",str(rawData)).group(1))
bytespp  = int(bitDepth/8)
height   = int(re.search("LINES\s+=\s+(\d+)",str(rawData)).group(1))
width    = int(re.search("LINE_SAMPLES\s+=\s+(\d+)",str(rawData)).group(1))
print(bitDepth,height,width)

# Offset from start of file to image data - assumes image at tail end of file
offset = fs - (width*height*bytespp)

# Check bitDepth
if bitDepth == 8:
    na = np.frombuffer(rawData, offset=offset, dtype=np.uint8).reshape(height,width)
elif bitDepth == 16:
    dt = np.dtype(np.uint16)
    dt = dt.newbyteorder('>')
    na = np.frombuffer(rawData, offset=offset, dtype=dt).reshape(height,width).astype(np.uint8)
else:
    print(f'ERROR: Unexpected bit depth: {bitDepth}',file=sys.stderr)

# Save either with PIL
Image.fromarray(na).save('result.jpg')
# Or with OpenCV may be faster
cv2.imwrite('result.jpg', na)

If you have thousands to do, I would recommend GNU Parallel which you can easily install on your Mac with homebrew using:

brew install parallel

You can then change my program above to accept a filename as parameter in-place of the hard-coded filename and the command to get them all done in parallel is:

parallel --dry-run script.py {} ::: *.IMG

For a bit more effort, you can get it done even faster by putting the code above in a function and calling the function for each file specified as a parameter. That way you can avoid starting a new Python interpreter per image and tell GNU Parallel to pass as many files as possible to each invocation of your script like this:

parallel -X --dry-run script.py ::: *.IMG

The structure of the script then looks like this:

def processOne(filename):
    open, read, search, extract, save as per my code above

# Main - process all filenames received as parameters
for filename in sys.argv[1:]:
    processOne(filename)
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thank you so much for taking out the time to write this answer. I've understood it and it is working for me. – Harshit Jindal Mar 09 '20 at 09:45
  • 1
    Excellent, you're welcome. Good luck with your project and remember, questions (and answers) are free, so come back if you get stuck. – Mark Setchell Mar 09 '20 at 09:58