-2

Suppose I have a file test.pdf but I changed the extension to jpeg such that the file is now test.jpeg. The actual file format should still be a pdf file as the file signature for a pdf is 25 50 44 46 2d and the file signature for a jpeg is either FF D8 FF DB or FF D8 FF EE and a couple other ones.

I tried it with a few suggestions from How can I check the extension of a file?. But they all appear to just be finding what the apparent file extension is. For example,

>>> file_name, file_extension = os.path.splitext("/Users/mark/Desktop/test.jpeg")
>>> file_extension
'.jpeg'
>>> 

As shown the file extension provided in the end is .jpeg but the real file extension is actually still .pdf.

Mark
  • 3,138
  • 5
  • 19
  • 36
  • 2
    The "real" file extension of a file called `test.jpeg` is `.jpeg`. Whatever the file's actual contents, the *file extension* is just the last part of the filename. – khelwood Feb 06 '20 at 22:52
  • 2
    you need to open the file and try to guess, using fourcc, magic number, whatever – Jean-François Fabre Feb 06 '20 at 22:53
  • @khelwood, not true, if there is a forum that allows an upload of images and someone writes a malware in python and changes the file extension to a jpeg and successfully uploads it, that's a problem – Mark Feb 06 '20 at 22:54
  • @Mark That is irrelevant to my statement. I'm not saying the contents of the file are not important. I'm saying your question suggests a misunderstanding of what "file extension" means. – khelwood Feb 06 '20 at 22:55
  • 1
    Mark, what @khelwood is trying to say is that the extension is inherently part of the filename. What you should be asking is what is the file type. – user1558604 Feb 06 '20 at 22:57
  • If you are using Linux, there is a system command call "file" that will make a pretty good guess at this for you. You would, of course, have to fork a process and examine the stdout. If you are using Windows, the GNUWIN32 toolset has a copy of the "file" command in it. – Frank Merrow Feb 06 '20 at 23:27

1 Answers1

0

For anyone who also has the same problem as me, the following worked for me. I had to install magic from https://github.com/ahupp/python-magic first.

>>> import magic
>>> magic.from_file("/Users/mark/Desktop/test_copy.jpeg")
'HTML document, ASCII text, with very long lines'
Mark
  • 3,138
  • 5
  • 19
  • 36