0

im trying to work on a folder full with downloaded xml (html) files. for now , the files are .txt files with hebrew in them , as i can see when i open the files.

when im trying to make a string out of the file to work on it , all of the hebrew becomes gibbrish. any ideas?

String fileSource = ("path/path");
    File  folder= new File(fileSource);
    File[] listOfFiles = folder.listFiles();
    for (File currentFile : listOfFiles) {
        try {
            content = FileUtils.readFileToString(currentFile , "UTF-8");

if i go and save the files as UTF-8 , it works. but i have so many files like that to work with.

Bar Hoshen
  • 302
  • 1
  • 18

1 Answers1

0

I'm going to assume that when you open a file, your browser/text-editor opens it as an ISO-8859-8 encoded file. When you save as UTF-8, it is why your above code works.

Therefore, your code need to open the file the same way as your browser/text-editor.

Try

FileUtils.readFileToString(currentFile , "ISO-8859-8");

EDIT:

Since we don't know the encoding your file uses, we can also try Windows 1255:

FileUtils.readFileToString(currentFile , Charset.forName("cp1255"));

Which appears to be the most common encoding for Hebrew according to Wikipedia.

Windows-1255 Hebrew is always in logical order (as opposed to visual). Microsoft Hebrew products (Windows, Office and Internet Explorer) brought logically-ordered Hebrew to common use, with the result that Windows-1255 is the Hebrew encoding that can be found most on the Web, having ousted the visually ordered ISO-8859-8, and preferred to the logically ordered ISO-8859-8-I because it provides for vowel-points.

Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • Perhaps it is `UTF-16` then. Try that in your original code. If that doesn't work either, see this link: http://stackoverflow.com/a/4735911/276949. It will heuristically detect the encoding you are using. – Martin Konecny May 02 '15 at 23:24
  • @BarHoshen, what do you do with `content` after you've read the file? Are you sure that your post-processing is using the correct charset -- and what is the default charset of your JVM? – Mick Mnemonic May 02 '15 at 23:32
  • UTF-8 is the charset. and even when i just syso it , its not good. i need to run on it and extract some info. btw - solved it with help of a friend by VB code that makes every file on a folder into UTF-8.. still i belive something is wrong in my work here and trying to figure it out – Bar Hoshen May 03 '15 at 01:31
  • It doesn't make sense to me that after you re-save the file as UTF-8 it begins working. To me that implies that the original file is not actually UTF-8. If you post the before/after files somewhere, I can take a quick look. – Martin Konecny May 03 '15 at 06:33