0

I am trying to find a specific word from list of files and these files can be ASCII, Unicode or some other format. So far I can only work on ASCII files . Is there any way to do same operation with other file encoding formats.

Scanner s = null;

        try {

            s = new Scanner(new BufferedReader(new FileReader("C:\\New Microsoft Word Document.docx")));

            while (s.hasNext()) {
//               final String lineFromFile = s.nextLine();
//              if(lineFromFile.contains("DE")){
                    System.out.println(s.next());
//                    break;
//              }

            }
        } finally {
            if (s != null) {
                s.close();
            }
        }

I get the following results

Q[µM¡°‰”Ø÷Þ3{:½¹®’)xTÖä¬?µXFÚB™QÎÞ‡Ïé=K0SˆÊÈÙ?õº×W?áÂ&¤6˜³qî?s”cÐ3ëÀÐJi½?^ýˆ;!¿Äøm«uÇ¥5LHCô`ÝΔbR…¤?§Ï+gF,y\í‹Q9S:êãw~Pá¡Â=‰p®RRª?OM±Ç•®™2R.÷àX9¼!ð#
qe—i;`­{¥fzU@2>¼Mä|f}Á
+'šªÎNÛ
user2614607
  • 151
  • 2
  • 6
  • 17

2 Answers2

0

docx is not a text format with a different encoding, it's a completely different, non-text file format. Basically, it'a zip archive of various files and folders (with the main data in some xml files). You can't just read it as a text file, you need to use a library such as Apache POI, or some kind of file converter to obtain the text from it.

0

This has nothing to do with a different text encoding.

docx is a special format from microsoft which holds various information about a document (packed as a zip archive).

You could read the file using java ZipFile and get the entry: word/document.xml document.xml contains the text of the word document. You can read then through this file and output specific lines.

Pseudocode:

ZipFile file = new ZipFile("doc.docx");
InputStream input = file.getInputStream(file.getEntry("word/document.xml"));

input contains now the text information.

EDIT: document.xml contains the text of the document, but there are many xml tags which you would have to filter out

maxammann
  • 1,018
  • 3
  • 11
  • 17
  • its clear docx is just one file but i also have other unicode files. Do u know any api for unicoded files – user2614607 Dec 09 '13 at 15:37
  • java is able to read those :D maybe you want to read this: http://stackoverflow.com/questions/3441490/whats-the-difference-between-an-encoding-a-character-set-and-a-code-page – maxammann Dec 09 '13 at 17:26