2

I am trying to read in the content of a file to any readable form. I am using a FileInputStream to read from the file to a byte array, and then am trying to convert that byte array into a String.

So far, I have tried 3 different ways:

FileInputStream inputStream = new FileInputStream(file);
byte[] clearTextBytes = new byte[(int) file.length()];
inputStream.read(clearTextBytes);

String s = IOUtils.toString(inputStream); //first way

String str = new String(clearTextBytes, "UTF-8"); //second way

String string = Arrays.toString(clearTextBytes); //third way
String[] byteValue = string.substring(1, string.length() - 1).split(",");
byte[] bytes = new byte[byteValue.length]
for(int i=0, len=bytes.length; i<len; i++){
   bytes[i] = Byte.parseByte(byteValue[i].trim());
}
String newStr = new String(bytes);

When I print out each of the Strings: 1) prints out nothing, and 2 & 3) print out a lot of weird characters, such as: PK!�Q���[Content_Types].xml �(���MO�@��&��f��]���pP<*���v �ݏ�,_��i�I�(zi�N��}fڝ���h�5)�&��6Sf����c|�"�d��R�d���Eo�r�� �l�������:0Tɭ�"Э�p'䧘��tn��&� q(=X����!.���,�_�WF�L8W......

I would love any advice on how to properly convert my byte array to a String.

Kevin Donahoe
  • 31
  • 1
  • 6
  • check this link http://stackoverflow.com/questions/88838/how-to-convert-strings-to-and-from-utf8-byte-arrays-in-java – AddyProg Dec 01 '15 at 13:13
  • Are you certain that it contains actual String data, i.e. it is the contents of a string written to the file? – Andy Turner Dec 01 '15 at 13:14
  • 7
    I'd guess your byte array does not contain a string in the first place. From the look of things you have given I'd say that's a Word document, not a txt. For reading the contents of a Word document you'd need some library like Apache POI – Jan Dec 01 '15 at 13:14
  • 3
    Are you sure the file is not a zip file ? Typically this happens when you try to read directly from a zip file and do not unzip it. – StackFlowed Dec 01 '15 at 13:16
  • 2
    I'd guess that "first way" doesn't print anything because you've already read everything from `inputStream` into `clearTextBytes`, so there are no more bytes to read. – Andy Turner Dec 01 '15 at 13:16
  • 1
    @StackFlowed ... and the file starts `PK` ;) – Peter Lawrey Dec 01 '15 at 13:22
  • Yes @Jan you're correct, my byte array contains contents of a Word document. To give a more complete idea of what I'm doing, I'm encrypting a file, and then decrypting it. So I read in the file using a FileInputStream, then encrypt that byte array, and then use a FileOutputStream to write out the encrypted file. That all seems to be working fine (since I wouldn't be able to read the encrypted text anyway). – Kevin Donahoe Dec 01 '15 at 13:25
  • However, when I decrypt the file, I try to do the same process. Read in the encrypted file using FileInputStream, decrypt the bytes, and write out to a new file using FileOutputStream. – Kevin Donahoe Dec 01 '15 at 13:25
  • However, this decrypted file is not actually decrypted, rather is still weird – Kevin Donahoe Dec 01 '15 at 13:25
  • characters: i.e, ó˘cá…ö&PìPÌ(b,fl∑ ∞ç ¬ [Content_Types].xml ¢ († ¥˛áfض]`å – Kevin Donahoe Dec 01 '15 at 13:26
  • 1
    but that might be decrypted zip or decrypted docx – Jan Dec 01 '15 at 13:26

4 Answers4

4

As others have noted, the data doesn't look like it contains any text, so it quite possibly binary data, rather than text. Note files which start with PK could be in PKZIP format and the randomness of your data does suggest it could be compressed. http://www.garykessler.net/library/file_sigs.html Try making the renaming the file to have .ZIP at the end and see if you can open it in file explorer.

From the link above, the start of a DOCX file looks as follows.

50 4B 03 04 14 00 06 00 PK...... DOCX, PPTX, XLSX

Microsoft Office Open XML Format (OOXML) Document

NOTE: There is no subheader for MS OOXML files as there is with
DOC, PPT, and XLS files. To better understand the format of these files,
rename any OOXML file to have a .ZIP extension and then unZIP the file;
look at the resultant file named [Content_Types].xml to see the content
types. In particular, look for the <Override PartName= tag, where you
will find word, ppt, or xl, respectively.

Trailer: Look for 50 4B 05 06 (PK..) followed by 18 additional bytes
at the end of the file.

Assuming you have text data, most likely the character encoding is not your default, nor UTF-8. You need to a) check what the encoding is, b) check the corruption is not when you output the string instead of in the input.

You can try brute force to find a character set which doesn't produce any unknown characters.

public static Set<Charset> possibleCharsets(byte[] bytes) {
    Set<Charset> charsets = new LinkedHashSet<>();
    for (Charset charset : Charset.availableCharsets().values()) {
        if (!new String(bytes, charset).contains("�"))
            charsets.add(charset);
    }
    return charsets;
}
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    Great - I have made it into a zip and have opened it. However, I'm a bit confused as to what you mean by checking what the encoding is and checking the corruption is not when I output the string instead of in the input. So is that possibleCharsets function of yours supposed to return all the sets of chars that don't include �, and then I create a new String out of that? Sorry I'm fairly new to bytes/binary data/ascii stuff. – Kevin Donahoe Dec 01 '15 at 13:45
  • (also, initially the Word document I was trying to read in was a simple .docx) – Kevin Donahoe Dec 01 '15 at 13:47
  • @KevinDonahoe there is nothing simple about the docx file format ;) You need a library designed to read such a document to have any chance of reading it. As it's a binary format, character encoding doesn't apply. – Peter Lawrey Dec 01 '15 at 15:35
0

UTF8 can hold about 2,097,152 different characters, them who have no image you see the questionmark. Try the classic dos codepage instead:

new String(clearTextBytes, "DOS-US");
Grim
  • 1,938
  • 10
  • 56
  • 123
0

Check this out for getting text contents of word file: You'd need Apache POI libraries.

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

[...]

   XWPFDocument docx = new XWPFDocument(new FileInputStream("file.docx"));       
   XWPFWordExtractor we = new XWPFWordExtractor(docx);
   System.out.println(we.getText());
Jan
  • 13,738
  • 3
  • 30
  • 55
0

I've written a very basic program to read the contents of a file and to print each string on a new line in the console. Here is the content of the file:

File1.txt

Here is the program I wrote:

import java.io.*;
import java.util.*;

class Test {
    public static void main(String args[]) throws FileNotFoundException {
        File file = new File("File1.txt");
        Scanner input = new Scanner(file);

        while (input.hasNext()) {
            System.out.println(input.next());
        }

        input.close();

    } // main()
} // class Test

This is the output to the console:

apples
pears
1
2
3
oranges
carrots
bananas
pineapples
sami
  • 11
  • 3