11

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable

content.encoding

and it would produce

ASCII-8BIT

in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?

Community
  • 1
  • 1
Dave
  • 15,639
  • 133
  • 442
  • 830
  • Given your requirements, It seems like you're gonna have to do some analysis of the content. I'd pull the top n bytes and check them against your standard ASCII codes. If many of the characters you encounter aren't ASCII, it's likely that your content is binary. Seems like a chi-squared test may be a good fit. Why can't you get access to the actual file object? – Brennan May 03 '17 at 19:09
  • I'm accessing the content from a database in which there is no additional information about the file. Sometimes there is a file name, but extensions are unreliable for determining file/content type. – Dave May 03 '17 at 20:24
  • Wait, the content of the file is in the DB? – Brennan May 03 '17 at 20:30
  • I your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the [ruby-filemagic gem](http://stackoverflow.com/a/901660/1544012) which will give you this information much more reliably. – Matouš Borák May 03 '17 at 20:40
  • @BoraMa, are you saying I need to write the file content to a file, then feed the file path into FileMagic, and it will tell me what type of file I have? – Dave May 03 '17 at 20:45
  • 1
    @Dave According to the gem's documentation at https://github.com/blackwinter/ruby-filemagic it can work with a buffer, so you wouldn't need to write anything to a file. Just read the first N bytes into memory and pass it to the gem. – Brian May 03 '17 at 21:34
  • But does this gem work with Rails 5? I'm getting a "Gem::Ext::BuildError: ERROR: Failed to build gem native extension" when I try and install it. – Dave May 03 '17 at 21:38
  • I adopted my recommendation into an answer. – Matouš Borák May 04 '17 at 04:27

2 Answers2

3

If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.

Sample usage for a string buffer (e.g. data read form the database):

require "ruby-filemagic"

content = File.read("/.../sample.pdf") # just an example to get some data

fm = FileMagic.new
fm.buffer(content)    
#=> "PDF document, version 1.4"

For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:

The file(1) library and headers are required:

Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic

Tested to work well under Rails 5.

Matouš Borák
  • 15,606
  • 1
  • 42
  • 53
  • Hmmm, I'm still getting a build error when I try and install this gem -- "checking for -lgnurx... no, *** ERROR: missing required library to compile this module". I will have to research that and then come back and try your suggestion. – Dave May 04 '17 at 14:12
  • What system are you trying this on? If you get stuck, can you post the full log with the error messages? – Matouš Borák May 04 '17 at 14:28
  • I hadn't run "brew install libmagic" per your suggestion. Running that does allow everything to install. One question I coudln't figure out from teh docs -- does "buffer" always print out file types in a consistent way? That is, do Excel docs always output "Microsoft Excel" and PDF docs always print out the word "PDF" ? – Dave May 04 '17 at 14:59
  • Good! Regarding your question, there isof course no absolute certainity, but I'd expect the output to be very consistent. The `file` utility with the associated `magic` library have been around for many many years and there is no reason that the authors would change its behavior. Take a look at the [sources](https://github.com/threatstack/libmagic/tree/master/magic/Magdir) for all the format variants that the library currently recognizes. – Matouš Borák May 04 '17 at 19:29
  • Heya, I started a bounty on this one only because I can see no consistency in the way file types are printed out by this gem. I'm getting too much variation to feel comfortable using the solution. – Dave May 16 '17 at 16:37
  • Out of curiosity, can you give an example of the variation? – Matouš Borák May 16 '17 at 19:07
  • look at the source of the gem then, if you do you'll see it's basically just a wrapper around what your system is identifying. So your problem is with the system utilities that identify the [Magic Byte](https://blog.netspi.com/magic-bytes-identifying-common-file-formats-at-a-glance/) of the file and not the gem itself. Basically, you've stumbled on one of the dark magic areas of computing, i.e. it's hard to determine file type accurately. – engineerDave May 17 '17 at 16:24
  • @engineerDave yeah, that's why I recommended to look at the library sources above. There are hundreds of formats and their variants recognized by the library. If you need to support only a few of them then it would make sense to use such library, if you need something more general (like "all binary formats") then you'd indeed need to use something else. – Matouš Borák May 17 '17 at 18:54
0

If you're on an unix machine, you can use the file command:

file titi.pdf

You could then do something like:

require 'open2'

cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
  stdin.write(content)
  stdin.close
  puts "file type is:" + stoud.read
end
Xavier Nicollet
  • 353
  • 6
  • 12