1

I have a .doc file with contains header before the ÐÏ , So I need to remove all the characters that are exist before the ÐÏ.

Example : asdfasdfasdfasfasdfasfÐÏ9asjdfkj

I have used the below code.

InputStream is = new   FileInputStream("D:\\Users\\Vinoth\\workspace\\Testing\\Testing_2.doc");
    DataInputStream dis = new DataInputStream(is);
    OutputStream os = new  FileOutputStream("D:\\Users\\Vinoth\\workspace\\Testing\\Testing_3.doc");
    DataOutputStream dos = new DataOutputStream(os);
    byte[] buff = new byte[dis.available()];
    dis.readFully(buff);
    char temp = 0;
    boolean start = false;
    try{
    for(byte b:buff){
        char c = (char)b;
        if(temp == 'Ð' && c == 'Ï' ){
            start = true;  
        }
        if(start){
            dos.write(c);
        }
        temp = c;

    }

However , it is not writing anything in my file as the first if condition is not getting satisfied. Please advise how can I perform this .

Vinoth
  • 63
  • 4
  • 23
  • I just need to remove the characters before that "ÐÏ" , and the content of the doc will still remain the same . I have tried this method by just read and write without any change, doc file is perfect . – Vinoth Jul 25 '16 at 15:59

1 Answers1

1

There is something wrong when you use char c = (char)b;

Refer to byte-and-char-conversion-in-java

You will see

A character in Java is a Unicode code-unit which is treated as an unsigned number.

Take your case as an example. The byte binary presentation of character 'Ï' is 11001111. Refer to oracle tutorial,

byte: The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive).

So the value of byte is -49. However, for Unicode usage, 11001111 should be interpreted as unsigned byte and it should be 207 actually.

int i = b & 0xff; 

will get the unsigned byte value of the binary presentation.

You can modify your code like below. For easily debug, I have changed the file path and file format. I'm not sure whether .doc is an issue but your code itself has bugs I mentioned actually.

import java.io.*;

public class Test {
    public static void main(String args[]){
        InputStream is;
        try {
            is = new   FileInputStream("Testing_2.txt");
            DataInputStream dis = new DataInputStream(is);
            OutputStream os = new  FileOutputStream("Testing_3.txt");
            DataOutputStream dos = new DataOutputStream(os);
            byte[] buff = new byte[dis.available()];
            dis.readFully(buff);
            char temp = 0;
            boolean start = false;
            for(byte b:buff){
                int i = b & 0xff;
                char c = (char)i;
                if(temp == 'Ð' && c == 'Ï' ){
                    start = true;  
                }
                if(start){
                    dos.write(c);
                }
                temp = c;

            }  
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();

        }
    }
}
Community
  • 1
  • 1
Eugene
  • 10,627
  • 5
  • 49
  • 67