0

Have a file that appears to have plaintext headers in them that I would like to extract and convert to plaintext.

Using HEXedit, this is what I'm seeing, which is in a file:

3a40 - 31 65 33 38 00 00 00 00 00 00 00 00 00 00 00 00 - 1e38............
3a50 - 00 00 00 00 00 00 00 00 00 00 0a 00 74 00 65 00 - ............t.e.
3a60 - 78 00 74 00 2f 00 61 00 73 00 63 00 69 00 69 00 - x.t./.a.s.c.i.i.
3a70 - 00 00 18 00 61 00 66 00 66 00 79 00 6d 00 65 00 - ....a.f.f.y.m.e
3a80 - 74 00 72 00 69 00 78 00 2d 00 61 00 72 00 72 00 - t.r.i.x.-.a.r.r
3a90 - 61 00 79 00 2d 00 62 00 61 00 72 00 63 00 6f 00 - a.y.-.b.a.r.c.o.
3aa0 - 64 00 65 00 00 00 64 00 40 00 35 00 32 00 30 00 - d.e...d.@.5.2.0.
3ab0 - 38 00 32 00 36 00 30 00 30 00 39 00 31 00 30 00 - 8.2.6.0.0.9.1.0.
3ac0 - 37 00 30 00 36 00 31 00 31 00 31 00 38 00 31 00 - 7.0.6.1.1.1.8.1.
3ad0 - 31 00 34 00 31 00 32 00 31 00 33 00 34 00 35 00 - 1.4.1.2.1.3.4.5.
3ae0 - 35 00 30 00 39 00 38 00 39 00 00 00 00 00 00 00 - 5.0.9.8.9.......
3af0 - 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - ................
3b00 - 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0a 00 - ................

and this is the output I'd like to get:

text/ascii  affymetrix-array-barcode d@52082600910706111811412134550989
blunders
  • 3,619
  • 10
  • 43
  • 65

3 Answers3

1

Try with the iconv command. Something like this should work:

tail -c +6 input.txt | iconv -f UTF16 -t ASCII >output.txt

Then split on the null bytes.

sapht
  • 2,789
  • 18
  • 16
Francisco R
  • 4,032
  • 1
  • 22
  • 37
  • @PacoRG: Thanks, just ran iconv and it returned this error: "iconv: illegal input sequence at position 0"; If I change UTF16 to UTF8, the position of the error changes to 428; the version of iconv is "iconv (GNU libc) 2.5" -- any suggestions? Thanks! – blunders May 10 '11 at 14:46
  • I think you need to manually delete the first 4bytes of your file. – Francisco R May 10 '11 at 15:00
  • @PacoRG: Not sure how to manually delete 4bytes, plus the solution can't be manual, since this is part of a script. – blunders May 10 '11 at 15:08
  • 1
    `cut -c4- input.txt |iconv -f UTF16 -t ASCII >output.txt` – Francisco R May 10 '11 at 15:20
  • +1 @PacoRG: Cool, so... ran that and got "iconv: illegal input sequence at position 4" -- any suggestions for debugging this? – blunders May 10 '11 at 15:32
  • 1
    Ups, i didn't notice your hexedit capture starts at 3a40 (14912). Try this: `cut -c14928- |iconv -f UTF16 -t ASCII >output.txt`. BTW, from the link you added, i see that the CEL file format is a complex one. iconv just can help you to extract some strings in a quick & dirty way, but nothing else. – Francisco R May 10 '11 at 15:53
  • +1 @PacoRG: Thanks! so... Hmm, that returns the same error. Just to be clear, the INPUT I provided is a very small sample of the file; just pointing this out, since it's not clear to me where "c4" or "c14928" is coming from. Also, used the command hexdump to a textfile, and unable to find the same pattern that hexedit said was present; the one I provide in the question. Guess I just figure the ASCII content was in ASCII and that I could use regex to target the HEX supplied, extract it, convert it to ASCII, and clean it up if need. – blunders May 10 '11 at 16:05
1

Granted, I'm no wiz, but this does the job if all your files look very similar to the one you just posted:

use strict;
open FILE, 'file.dat';
binmode FILE;
my ($chunk, $buf, $n);
seek FILE, 28, 0;
while (($n=read FILE, $chunk, 16)) { $buf .= $chunk; }
my @s=split(/\0\0/, $buf, 4);
print "$s[0] $s[1] $s[2]\n";
close (FILE);
sapht
  • 2,789
  • 18
  • 16
  • Ran the code in ptkdb (a perl debugger) and see the code doing something, but never get any print statements. What is the code doing and what output should I expect? Thanks! – blunders May 10 '11 at 15:45
  • 1
    @blunders Made a change -- it should work now. The expected output is "text/ascii affymetrix-array-barcode d@52082600910706111811412134550989" – sapht May 10 '11 at 17:09
  • +1 @sapht: Thanks, got it working. Though it's not clear if the code will be adaptable to my needs; meaning that the code appears to target the extraction based on position, which will be be the same. The only thing that I know will stay the same is the HEX for "affymetrix-array-barcode" and the general formatting of the barcode itself. Guess I just figure the ASCII content was in ASCII and that I could use regex to target the HEX supplied, extract it, convert it to ASCII, and clean it up if need. – blunders May 10 '11 at 17:42
0

A perl solution might be interesting, but wouldn't the unix strings command give you the plaintext portion of the file?

pavium
  • 14,808
  • 4
  • 33
  • 50
  • @pavium: Code is running on CentOS, if you're able to call a system command and extract the output provided from the input list with perl -- yes, that's fine. That said, I have very limited understanding of Perl, and have never used the "strings" command; meaning currently your answer reads more like a comment for me. Thanks! – blunders May 10 '11 at 14:10
  • @pavium: So, strings does not appear to work, it's not seeing the data as a string. Meaning I ran the following command, "strings [ABSOLUTE PATH] >> [ABSOLUTE PATH]/string_output.txt" and the strings above were not present in the output; which is not to say there was not output. – blunders May 10 '11 at 14:31
  • 2
    @blunders, `man strings` should tell you more than you want to know about the command. I wouldn't want to pontificate on *how to do it* because it's very late here and I'd probably give you bad advice. Tomorrow, as a challenge, I might consider how to do it all in perl when I've had a good rest. – pavium May 10 '11 at 14:36
  • 1
    @pavium: Great, thank you! I've been looking at the man page, if I figure out a solution I'll comment again, again thanks!! – blunders May 10 '11 at 14:39
  • 1
    @blunders, the [ABSOLUTE PATH] you mentioned *was* the path to the binary file? I ask because in [ABSOLUTE PATH]/string_output.txt it should be a path to the directory *containing* the binary file. – pavium May 10 '11 at 14:41
  • +1 @pavium: That's correct, [ABSOLUTE PATH] was the directory path in the second reference. The string processed the file, it just that it's not reading the data I need as a string; meaning it is randomly reading other strings which I know are from the file. Again, thanks! – blunders May 10 '11 at 15:21
  • 1
    @blunders, I upvoted the question because it seemed 'clear and useful' but someone downvoted it. Similarly, someone upvoted my answer soon after I posted it, but that's been downvoted too - probably because it doesn't actually provide a solution using Perl. I'm losing the enthusiasm I had last night ... so I probably won't rush to try a Perl solution, especially now you seem to have a working solution from sapht. – pavium May 10 '11 at 23:15
  • @pavium: Hmm, okay - no big deal, posted a follow up question, since my question didn't seem to be clear. If you do find an answer, I'd likely except it as the answer for both questions. Cheers! http://stackoverflow.com/questions/5954784/extracting-strings-from-binary-file-using-regex-and-converting-to-ascii-using – blunders May 10 '11 at 23:23