Create Binary files in UNIX

Question

This question was out there for a while and I thought I should offer some bonus points if I can get it to work.

What did I do…

Recently at work, I wrote a parser that would convert a binary file in a readable format. Binary file isn't an Ascii file with 10101010 characters. It has been encoded in binary. So if I do a cat on the file, I get the following -

[jaypal~/Temp/GTP]$ cat T20111017153052.NEW 
==?sGTP?ղ?N????W????&Xx1?T?&Xx1?;
?d@#e?
      ?0H????????|?X?@@(?ղ??VtPOC01
cceE??k@9??W傇??R?K?i2??d@#e???&Xx1&Xx??!?
blackberrynet?/??!

??!

??#ripassword??W傅?W傆??0H??
                            #R??@Vtc@@(?ղ??n?POC01

So I used hexdump utility to make the file display following content and redirected it to a file. Now I had my output file which was a text file containing Hex values.

[jaypal~/Temp/GTP]$ hexdump -C T20111017153052.NEW 
00000000  3d 3d 01 f8 73 47 54 50  02 f1 d5 b2 be 4e e4 d7  |==..sGTP.....N..|
00000010  00 01 01 00 01 80 00 cc  57 e5 82 00 00 00 00 00  |........W.......|
00000020  00 00 00 00 00 00 00 00  87 d3 f5 13 00 00 00 00  |................|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 01 00 10  |................|
00000040  01 01 0f 00 00 00 00 00  26 58 78 31 00 b3 54 c5  |........&Xx1..T.|
00000050  26 58 78 31 00 b4 3b 0a  00 00 ad 64 13 40 01 03  |&Xx1..;....d.@..|
00000060  23 16 65 f3 01 01 0b 91  30 19 48 99 f2 ff ff ff  |#.e.....0.H.....|
00000070  ff ff ff 02 00 7c 00 dc  01 58 00 a0 40 40 28 02  |.....|...X..@@(.|
00000080  f1 d5 b2 b8 ca 56 74 50  4f 43 30 31 00 00 00 00  |.....VtPOC01....|
00000090  00 04 0a 63 63 07 00 00  00 00 00 00 00 00 00 00  |...cc...........|
000000a0  00 00 00 65 45 00 00 b4  fb 6b 40 00 39 11 16 cd  |...eE....k@.9...|
000000b0  cc 57 e5 82 87 d3 f5 52  85 a1 08 4b 00 a0 69 02  |.W.....R...K..i.|
000000c0  32 10 00 90 00 00 00 00  ad 64 00 00 02 13 40 01  |2........d....@.|

After tons of awk, sed and cut, the script converted hex values into readable text. To do so, I used the offset positioning which would mark start and end position of each parameter converted. The resulting file after all conversion looks like this

[jaypal:~/Temp/GTP] cat textfile.txt 
Beginning of DB Package Identifier: ==
Total Package Length: 508
Offset to Data Record Count field: 115
Data Source: GTP
Timestamp: 2011-10-25
Matching Site Processor ID: 1
DB Package format version: 1
DB Package Resolution Type: 0
DB Package Resolution Value: 1
DB Package Resolution Cause Value: 128
Transport Protocol: 0
SGSN IP Address: 220.206.129.47
GGSN IP Address: 202.4.210.51

Why did I do it

I am a test engineer and to manually validate binary files was a major pain. I had to manually parse through the offsets and use a calculator to convert them and validate it against Wireshark and GUI.

Now the question part

I wish to do the reverse of what I did. This was my plan -

Have an easy to read Input text file which would have Parameters : Values.
User can simply put values next to them (eg Date would be a parameter and user can give date they want the data file to have).
The script will cut out all relevent information (user provided information) from the Input text file and convert them into hex values.
Once the file has been converted in to hex values, I wish to encode it back into binary.

First three steps are done

Problem

Once my script converts the Input text file in to a text file with hex values, I get a file like follows (notice I can do cat on it).

[visdba@hw-diam-test01 ParserDump]$ cat temp_file | sed 's/.\{32\}/&\n/g' | sed 's/../& /g'
3d 3d 01 fc 73 47 54 50 02 f1 d6 55 3c 9f 49 9c
00 01 01 00 01 80 00 dc ce 81 2f 00 00 00 00 00
00 00 00 00 00 00 00 00 ca 04 d2 33 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 10
01 01 0f 00 00 07 04 ea 00 00 ff ff 00 00 14 b7
00 00 ff ff 00 00 83 ec 00 00 83 62 54 14 59 00
60 38 34 f5 01 01 0b 58 62 70 11 60 f6 ff ff ff
ff ff ff 02 00 7c 00 d0 01 4c 00 b0 40 40 28 02
f1 d6 55 38 cb 2b 23 50 4f 43 30 31 00 00 00 00
00 04 0a 63 63 07 00 00 00 00 00 00 00 00 00 00

My intension is to encoded this converted file in to a binary so that when I do cat on the file, I get bunch of garbage values.

[jaypal~/Temp/GTP]$ cat temp.file 
==?sGTP?ղ?N????W????&Xx1?T?&Xx1?;
?d@#e?
      ?0H????????|?X?@@(?ղ??VtPOC01
cceE??k@9??W傇??R?K?i2??d@#e???&Xx1&Xx??!?
blackberrynet?/??!

??!

So the question is this. How do I encode it in this form?

Why I want to do this?

We don't have a lot of GTP (GPRS Tunnelling Protocol) messages on production. I thought if I reverse engineer this, I could effectively create a data generator and make my own data.

Sum things up

There may be sophisticated tools out there, but I don't want to spend too much time learning them. It's been around 2 months, I have started working on the *nix platform and just getting hand around it's power tools like sed and awk.

What I do want is some help and guidance to make this happen.

Thanks again for reading! 200 points awaits for someone who can guide me in the right direction. :)

Sample Files

Here is a sample of Original Binary File

Here is a sample of Input Text File that would allow the User to punch in values

Here is a sample of File that my script creates after all the conversion from the Input Text File is complete.

How do I change the encoding of File 3 to File 1?

If you had shown the parser code, we could have gone and started to show a reversal. Now it would amount to reverse engineering from a single, noisy, sample. That's not going to work. Perhaps if you put a bonus of 500 pts on it :) — sehe, Nov 10 '11 at 22:06
If I had 500 points, I would have. :) Well it's not the conversion I am worried about. Even once the conversion is done the resulting file would still be a text file. The only difference would be that instead of ASCII or Decimal content, it would be Binary. My main concern is to make a binary file. File when I do a cat would give me garbage characters. — jaypal singh, Nov 11 '11 at 01:27
The 'File' that your script produced is not a direct hex dump of the 'Binary File'. The fourth character in 'Binary File' is 0xF8; the fourth character in 'File' is coded as 'fc'. Is this a problem with the data downloaded, or ... what? However, if you remove the `cut` from my command, the remaining `awk` script will take the contents of 'File' and produce binary output containing a faithful conversion of the pairs of hex digits into corresponding bytes in the range 0x00..0xFF. — Jonathan Leffler, Nov 27 '11 at 03:17
No these were just samples. But the problem I am facing is that the file my script is making is a text file even though the content inside it may not be text. The loader process rejects these files since it only accepts binaries. — jaypal singh, Nov 27 '11 at 03:26
The binary output from your script if I redirect into a file, the encoding will still make the file a text file containing binary data. I want to encode the file into binary. — jaypal singh, Nov 27 '11 at 03:28
You'll need to define what you mean by 'binary file' and 'loader process'. You may also need to define platform and other information. — Jonathan Leffler, Nov 27 '11 at 06:57
Sure Jon. I have pointed to what type of file I need to make (binary file) and the type of file I have. — jaypal singh, Nov 27 '11 at 07:09
Thanks. I'm not sure I understand the platform yet. I wonder how you are transferring files to the 'loader process'? Is it via FTP? If so, are you running FTP in binary mode or ASCII mode? It might matter. Given an 8-bit clean file system (such as found on Unix systems), the conversion from original (binary) to hex dump format, and then (using the code I provided) to convert the hex dump back into a (binary) file is also clean. If you are working on a system where there is a distinction between text and binary files, then it may not be possible to round-trip the two conversions accurately. — Jonathan Leffler, Nov 27 '11 at 07:16
Thanks Jon, I will check with my Development team and let you know. Currently what we are doing is, using wireshark tool to capture the raw pcaps. These raw pcaps are relayed via a tool called tcp-replay to a Probe Proxy. This piece of code adds additional information to the raw pcap and sends to the Databroker tool via tcp stream. This databroker creates the binary .NEW files. Both raw pcaps and .NEW files if I do a cat command on them gives me garbage values. I guess this is my definition of Binary files :). These binary files are then lifted by Tandem Loaders to loaded onto Tandem DB. — jaypal singh, Nov 27 '11 at 08:19
Thanks Dejan, I have formatted the question again so that it is a little more clearer. :) — jaypal singh, Nov 28 '11 at 20:37
Read my answer. I answered the question, and also wrote a simple start of a BASH-based packet builder for you. Take that code, end improve it. — DejanLekic, Nov 28 '11 at 21:51

score 16 · Accepted Answer · 2011-11-29T08:31:47.940

You can use xxd to convert to and from binary files / hexdumps quite simply.

data to hex

echo  Hello | xxd -p 
48656c6c6f0a

hex to data

echo 48656c6c6f0a | xxd -r -p
Hello

or

echo 48 65 6c 6c 6f 0a | xxd -r -p
Hello

The -p is postscript mode which allows for a more freeform input

This is the output from xxd -r -p text where text is the data you give above

==▒sGTP▒▒U<▒I▒▒▒΁/▒▒3▒▒▒▒▒▒▒▒▒bTY`84▒
                                     Xbp`▒▒▒▒▒▒▒|▒L▒@@(▒▒U8▒+#POC01
:▒ިv▒b▒▒▒▒TY`84Ud▒▒▒▒>▒▒▒▒▒▒▒!▒
blackberrynet▒/▒▒!
M
▒▒!
N
▒▒#Oripassword▒▒΁/▒▒΁/▒▒Xbp`▒@@(▒▒U8▒IvPOC01
:qU▒b▒▒▒▒▒▒TY`84U▒▒▒*:▒▒!
▒k▒▒▒#O Welcmme!
▒!
M

Jonathan Leffler · Answer 2 · 2011-11-27T02:46:50.437

Using cut and awk, you can do it fairly simply using a gawk (GNU Awk) extension function, strtonum():

cut -c11-60 inputfile |
awk '{ for (i = 1; i <= NF; i++)
       {
           c = strtonum("0x" $i)
           printf("%c", c);
       }
     }' > outputfile

Or, if you are using a non-GNU version of 'new awk', then you can use:

cut -c11-60 inputfile |
awk '{  for (i = 1; i <= NF; i++)
        {
            s = toupper($i)
            c0 = index("0123456789ABCDEF", substr(s, 1, 1)) - 1
            c1 = index("0123456789ABCDEF", substr(s, 2, 1)) - 1
            printf("%c", c0*16 + c1);
        }
     }' > outputfile

If you want to use other tools (Perl and Python sprint to mind; Ruby would be another possibility), you can do it easily enough.

odx is a program similar to the hexdump program. The script above was modified to read 'hexdump.out' as the input file, and the output piped into odx instead of a file, and gives the following output:

$ cat hexdump.out
00000000  3d 3d 01 fc 73 47 54 50  02 f1 d6 55 3c 9f 49 9c  |==..sGTP...U<.I.|
00000010  00 01 01 00 01 80 00 dc  ce 81 2f 00 00 00 00 00  |........../.....|
00000020  00 00 00 00 00 00 00 00  ca 04 d2 33 00 00 00 00  |...........3....|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 10  |................|
00000040  01 01 0f 00 00 07 04 ea  00 00 ff ff 00 00 14 b7  |................|
00000050  00 00 ff ff 00 00 83 ec  00 00 83 62 54 14 59 00  |...........bT.Y.|
00000060  60 38 34 f5 01 01 0b 58  62 70 11 60 f6 ff ff ff  |`84....Xbp.`....|
00000070  ff ff ff 02 00 7c 00 d0  01 4c 00 b0 40 40 28 02  |.....|...L..@@(.|
$ sh -x revdump.sh | odx
+ cut -c11-60 hexdump.out
+ awk '{  for (i = 1; i <= NF; i++)
        {
            #c = strtonum("0x" $i)
            #printf("%c", c);
            s = toupper($i)
            c0 = index("0123456789ABCDEF", substr(s, 1, 1)) - 1
            c1 = index("0123456789ABCDEF", substr(s, 2, 1)) - 1
            printf("%c", c0*16 + c1);
        }
     }'
0x0000: 3D 3D 01 FC 73 47 54 50 02 F1 D6 55 3C 9F 49 9C   ==..sGTP...U<.I.
0x0010: 00 01 01 00 01 80 00 DC CE 81 2F 00 00 00 00 00   ........../.....
0x0020: 00 00 00 00 00 00 00 00 CA 04 D2 33 00 00 00 00   ...........3....
0x0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 10   ................
0x0040: 01 01 0F 00 00 07 04 EA 00 00 FF FF 00 00 14 B7   ................
0x0050: 00 00 FF FF 00 00 83 EC 00 00 83 62 54 14 59 00   ...........bT.Y.
0x0060: 60 38 34 F5 01 01 0B 58 62 70 11 60 F6 FF FF FF   `84....Xbp.`....
0x0070: FF FF FF 02 00 7C 00 D0 01 4C 00 B0 40 40 28 02   .....|...L..@@(.
0x0080:
$

Or, using hexdump -C in place of odx:

$ sh -x revdump.sh | hexdump -C
+ cut -c11-60 hexdump.out
+ awk '{  for (i = 1; i <= NF; i++)
        {
            #c = strtonum("0x" $i)
            #printf("%c", c);
            s = toupper($i)
            c0 = index("0123456789ABCDEF", substr(s, 1, 1)) - 1
            c1 = index("0123456789ABCDEF", substr(s, 2, 1)) - 1
            printf("%c", c0*16 + c1);
        }
     }'
00000000  3d 3d 01 fc 73 47 54 50  02 f1 d6 55 3c 9f 49 9c  |==..sGTP...U<.I.|
00000010  00 01 01 00 01 80 00 dc  ce 81 2f 00 00 00 00 00  |........../.....|
00000020  00 00 00 00 00 00 00 00  ca 04 d2 33 00 00 00 00  |...........3....|
00000030  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 10  |................|
00000040  01 01 0f 00 00 07 04 ea  00 00 ff ff 00 00 14 b7  |................|
00000050  00 00 ff ff 00 00 83 ec  00 00 83 62 54 14 59 00  |...........bT.Y.|
00000060  60 38 34 f5 01 01 0b 58  62 70 11 60 f6 ff ff ff  |`84....Xbp.`....|
00000070  ff ff ff 02 00 7c 00 d0  01 4c 00 b0 40 40 28 02  |.....|...L..@@(.|
00000080
$

Thanks Jon, Well I have already done the conversion part of the script. I just want to encode that converted file in to Binary file. Even after the conversion, I can view the file simply by doing `cat`. — jaypal singh, Nov 27 '11 at 02:43
If you write the output of the `awk` script to a file, that file will be a binary file with the same content as the binary file from which the `hexdump -C` output was generated. So, given the `hexdump` output, you can regenerate the original binary data. However, the binary data is hard to show - so the best I can do is show that binary file dumpers produce the equivalent output to what `hexdump -C` produced in the first place. — Jonathan Leffler, Nov 27 '11 at 02:49
I have uploaded the Binary File, My Input File and the converted file. I need the converted file encoded in binary. — jaypal singh, Nov 27 '11 at 03:03

daouzli · Answer 3 · 2017-01-12T12:38:06.800

There's a tool binmake allowing to describe in text format some binary data and generate a binary file (or output to stdout). It allows to change the endianess and number formats and accepts comments.

First get and compile binmake (the binary program will be in bin/):

$ git clone https://github.com/dadadel/binmake
$ cd binmake
$ make

Create your text file file.txt:

# an exemple of file description of binary data to generate
# set endianess to big-endian
big-endian

# default number is hexadecimal
00112233

# man can explicit a number type: %b means binary number
%b0100110111100000

# change endianess to little-endian
little-endian

# if no explicit, use default
44556677

# bytes are not concerned by endianess
88 99 aa bb

# change default to decimal
decimal

# following number is now decimal
0123

# strings are delimited by " or '
"this is some raw string"

# explicit hexa number starts with %x
%xff

Generate your binary file file.bin:

$ ./binmake file.txt file.bin
$ hexdump file.bin -C
00000000  00 11 22 33 4d e0 77 66  55 44 88 99 aa bb 7b 74  |.."3M.wfUD....{t|
00000010  68 69 73 20 69 73 20 73  6f 6d 65 20 72 61 77 20  |his is some raw |
00000020  73 74 72 69 6e 67 ff                              |string.|
00000027

You can also pipe it using stdin and stdout:

$ echo '32 decimal 32 %x61 61' | ./binmake | hexdump -C
00000000  32 20 61 3d                                       |2 a=|
00000004

DejanLekic · Answer 4 · 2011-11-28T23:04:50.193

1

To change encoding from File3 to File1, you use a script like this:

#!/bin/bash

# file name: tobin.sh

fileName="tobin.txt"   # todo: pass it as parameter
                       #       or prepare it to be used via the pipe...
while read line; do
  for hexValue in $line; do
    echo -n -e "\x$hexValue"
  done
done < $fileName

Or, if you just want to pipe it, and use like the xxd example in this thread:

#!/bin/bash

# file name: tobin.sh
# usage: cat file3.txt | ./tobin.sh > file1.bin

while read line; do
  for hexValue in $line; do
    echo -n -e "\x$hexValue"
  done
done

If you really want to use BASH for this, then I suggest you start using array to nicely build your packet. Here is starting code:

#!/bin/sh

# We assume the script will run on a LSB architecture.

hexDump() {
  for idx in $(seq 0 ${#buffer[@]}); do
    printf "%02X", ${buffer[$idx]}
  done
} # hexDump() function

###
# dump() dumps the current content of the buffer[] array to the STDOUT.
#
dump() {
  # or, use $ptr here...
  for idx in $(seq 0 ${#buffer[@]}); do
    printf "%c" ${buffer[$idx]}
  done
} # dump() function

# Beginning of DB Package Identifier: ==
buffer[0]=$'\x3d' # =
buffer[1]=$'\x3d' # =
size=2

# Total Package Length: 2
# We start with 2, and later on we update it once we know the exact size...
# Assuming 32bit architecture, LSB, this is how we encode number 2 (that is our current size of the packet)
buffer[2]=$'\x02'
buffer[3]=$'\x00'
buffer[4]=$'\x00'
buffer[5]=$'\x00'

# Offset to Data Record Count field: 115
# I assume this is also a 32bit field of unsigned int type
ptr=5
buffer[++ptr]=$'\x73'  # 115
buffer[++ptr]=$'\x00'
buffer[++ptr]=$'\x00'
buffer[++ptr]=$'\x00'

#hexDump
dump

Output:

$ ./tobin2.sh | hexdump -C
00000000  3d 3d 02 00 00 00 73 00  00 00 00                 |==....s....|
0000000b

Sure, this is not solution the the original post... The solution will use something like this to generate binary output. The biggest problem is that we still do not know the types of fields in the packet. We also do not know the architecture (is it bigendian, or littleendian, is it 32bit, or 64bit). You must give us the specification. For an instance, the lenght of the package is of what type? We do not know that from that TXT file!

In order to help you do what you have to do, you must find us the specification about sizes of those fields.

Note it is a good start though. You need to implement convenience functions to, for an example, automatically fill the buffer[] with values from a string encoded with hex values. So you can do something like write $offset "ff c0 d3 ba be".

edited Nov 28 '11 at 23:04

answered Nov 28 '11 at 20:07

DejanLekic

18,787
4
46
77

`./tobin.sh | hexdump -C` produces the same hexdump as above. Well, at least the first chunk is the same. I guess it depends what kind of binary input you used for generating that txt file with hex values... – DejanLekic Nov 28 '11 at 20:27
This looks really good Dejan. I will try it out. I have the offset positions in the design document. Will have to talk to my manager if it would be ok to publish the offset positioning for all, since it's confidential and proprietary information. But I can share some of them. Thanks again! – jaypal singh Nov 28 '11 at 21:54
No need to disclose such information here. You get the point from the code above. Just keep filling the buffer[] array with values, and keep thinking about the architecture. I would start with writing `writeUInt32()` function so you can use it like `writeUInt32 $offset 0xbabadeda` – DejanLekic Nov 28 '11 at 21:58
The solution provided by @lain works without passing in the values. – jaypal singh Nov 28 '11 at 22:17
I am not talking here about conversion from File3 to File1 (if you carefully read that code, it can behave exactly like xxd - just replace the fileName line with `fileName=$1` and call it via: `tobin.sh file3.txt`). The biggest problem here is the format and how you generate a correct binary output. Because, for an example, the number of parameters may vary, and the length changes, so you have to modify the buffer and set the proper length, otherwise the application which reads that binary input will fail to "understand" it. – DejanLekic Nov 28 '11 at 22:46
Hmm, for a quick test, I did a hexdump on the original .NEW file > temp.file and then removed the Offset column and ASCII portion. Did the xxd on it and the diff of two files did not show any difference. But I agree, you make an excellent point here. I will try to load the files created by xxd through the loader. If they are rejected then will start working on extending your script. Thanks again for bringing up such good points. If the xxd fails, I will open a separate question with 500 points bounty just for you. :) – jaypal singh Nov 28 '11 at 22:51
I do not do this for bounty - it is just an interesting topic. I would do a D or C/C++ application for this myself, but as I said, if you really want to do it in BASH, it is possible, but not trivial. :) It is nearly impossible to do this if you do not know the specification of that format... – DejanLekic Nov 28 '11 at 22:54
Please don't get me wrong. I really do appreciate you spending time on this. Hopefully, I will get to work on *nix backend going forward cause I really hate HP Tandem Guardian OS and I have been spending too much time learning sed and awk. It's funny, I can now write some pretty cool awk scripts, but I dont know how to write basic shell script. :) – jaypal singh Nov 28 '11 at 22:57
I've add the source of tobin.sh which can be used as xxd (ie. you can pipe the input). It is good in case you do not have the xxd utility. – DejanLekic Nov 28 '11 at 23:06

score 0 · Answer 5 · answered Nov 10 '11 at 21:03

0

awk is the wrong tool for the job here, but there are a thousand ways to do it. The easiest way is often a small C program, or any other language that explicitely makes a distinction between a character and a string of decimal digits.

However, to do it in awk, use the "%c" printf format.

answered Nov 10 '11 at 21:03

thiton

35,651
4
70
100

Thanks Thiton, conversion is not the actual problem. Even after the conversion is done, the file will still be an ascii file, i.e I can do a cat command and it will display all characters. I need to make a binary file instead, which when I do hexdump filename, I get the same .NEW file shown above. – jaypal singh Nov 10 '11 at 21:19
I have to agree with @thiton using text processing tools to write binary files is arduous and error prone. If you want to use a scripting language then perl and the `pack/unpack` functions will be your friend. – potong Nov 10 '11 at 21:19
agree with basic concept that unix tool box is not the way to create binary files, but I think I've seen people use `dd` here on S.O. to do similar stuff? Good luck to all. – shellter Nov 11 '11 at 04:27