5

I use Java to read list of files. Some of these has different encoding, ANSI instead of UTF-8. java.util.Scanner is unable to read these files and get empty output string. I tried another approach:

                FileInputStream fis = new FileInputStream(my_file);
                BufferedReader br = new BufferedReader(new InputStreamReader(fis));
                InputStreamReader isr = new InputStreamReader(fis);
                isr.getEncoding();

I am not sure how to change character encoding in case of ANSI ones. UTF-8 and ANSI files are mixed in same folder. I try to use Apache Tika for this. After I get encoding of file, I use Scanner, but I get empty output.

Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
line = scanner.nextLine();
plaidshirt
  • 5,189
  • 19
  • 91
  • 181
  • @Nirekin : In this case I have UTF-8 and ANSI encoded mixed, so unable to set a fixed solution. – plaidshirt Nov 06 '18 at 12:20
  • 1
    Possible duplicate of [Java : How to determine the correct charset encoding of a stream](https://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) – locus2k Nov 06 '18 at 13:17
  • @locus2k : I see, but how to use detected charset in Scanner? – plaidshirt Nov 06 '18 at 13:23
  • Open an input stream with the desired charset and if it fails try the next one until it works. The link has some solutions. – locus2k Nov 06 '18 at 13:29
  • @locus2k : I get charset, and set it in Scanner too, but output string is empty. – plaidshirt Nov 06 '18 at 13:51
  • Please add a code snippit on how you're using the scanner after you initialize it. – locus2k Nov 06 '18 at 13:55
  • try to set the detected encoding on file open: `String fileName = getFileNameToReadFromUserInput(); FileInputStream is = new FileInputStream(fileName); InputStreamReader isr = new InputStreamReader(is, getCorrectCharsetToApply()); BufferedReader buffReader = new BufferedReader(isr);` – Marc Stroebel Nov 06 '18 at 14:01
  • @MarcStröbel : I set it in Scanner. – plaidshirt Nov 06 '18 at 14:03
  • @locus2k : I added it. – plaidshirt Nov 06 '18 at 14:03
  • Make sure your files are not empty on top then put your scanner in a while loop `while(scanner.hasNextLine()) line = scanner.nextLine();` If that doesn't work then If all you are doing is reading lines, you can try the normal buffered reader way. – locus2k Nov 06 '18 at 14:07
  • @locus2k : I see, but as I understand, Scanner should work too in this way. – plaidshirt Nov 06 '18 at 15:27
  • Are you (or the application) in full control of these files or do they come from external sources? If you're in full control of the files then one solution is to include their encoding in the file name. You'd then parse the filename to get the encoding before opening the `Scanner`. – Slaw Nov 14 '18 at 16:11
  • @Slaw : No, these files come from external source. – plaidshirt Nov 15 '18 at 12:28

3 Answers3

2

There is a library called juniversalchardet, which can help you at guessing the right encoding. It was updated recently and is currently located on GitHub:

https://github.com/albfernandez/juniversalchardet

However, there is no fail-safe tool to detect encodings, as there are many things unknown:

  1. Is this file text at all or some PNG?
  2. Is it stored in a (1,...,k,...,n)-bit encoding?
  3. Which k-bit encoding was used?

Some guesswork can be done by counting the amount of control characters that are not commonly used. When a file contains many control symbols, it is likely that you've chosen the wrong encoding. (Then try the next one.)

Juniversalchardet tries multiple and also more successful ways to determine encodings (even chinese ones). It also provides convenient ways to open a reader from a file with the correct encoding already selected:

(Snippet taken from https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding and adapted)

import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;

public class TestCreateReaderFromFile {

    public static void main (String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
            System.exit(1);
        }

        Reader reader = null;
        try {
            File file = new File(args[0]);
            reader = ReaderFactory.createBufferedReader(file);

            String line;
            while((line=reader.readLine())!=null){
                System.out.println(line); //Print each line to console
            }
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }

    }

}

Edit: Added ScannerFactory

/*
(C) Copyright 2016-2017 Alberto Fernández <infjaf@gmail.com>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;

/**
 * Create a scanner from a file with correct encoding
 */
public final class ScannerFactory {

    private ScannerFactory() {
        throw new AssertionError("No instances allowed");
    }
    /**
     * Create a scanner from a file with correct encoding
     * @param file The file to read from
     * @param defaultCharset defaultCharset to use if can't be determined
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */

    public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
        Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
        String detectedEncoding = UniversalDetector.detectCharset(file);
        if (detectedEncoding != null) {
            cs = Charset.forName(detectedEncoding);
        }
        if (!cs.toString().contains("UTF")) {
            return new Scanner(file, cs.name());
        }
        Path path = file.toPath();
        return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
    }
    /**
     * Create a scanner from a file with correct encoding. If charset cannot be determined,
     * it uses the system default charset.
     * @param file The file to read from
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */
    public static Scanner createScanner(File file) throws IOException {
        return createScanner(file, Charset.defaultCharset());
    }
}
Friwi
  • 482
  • 3
  • 13
1

Your approach will not give you the right encoding.

 FileInputStream fis = new FileInputStream(my_file);
 BufferedReader br = new BufferedReader(new InputStreamReader(fis));
 InputStreamReader isr = new InputStreamReader(fis);
 isr.getEncoding();

This will return the encoding being used by this InputStream (read javadoc) and not that of the charcters written in the file (my_file in your case). And if the encoding is wrong Scanner won't be able to read the file properly.

In fact, do correct me if i am wrong, there is no way to get encoding used for a particular file with 100% accuracy. There are few projects which have a better success rate at guessing the encoding but not 100% accuracy. On the other hand if you know the encoding used then you can read the file using,

Scanner scanner = new Scanner(my_file, "charset");
scanner.nextLine();

Also, find out the correct charset name used in java for ANSI. It's either US-ASCII or Cp1251.

Whichever path you go, be on lookout for any IOException which might point you in the right direction.

Vicky Singh
  • 401
  • 4
  • 6
  • I tried Cp1252 and Cp1251 for these files, but string isn't present in output. – plaidshirt Nov 12 '18 at 07:35
  • @plaidshirt Can you share sample text that you are trying to read along with the code ? – Vicky Singh Nov 12 '18 at 09:33
  • It isn't depends on the text, because it is formatted in the same way every time (key: value). Only difference is type of encoding between these files. – plaidshirt Nov 12 '18 at 10:45
  • As said in the answer you cannot get an encoding by just looking at the file. If you know the outcome, because you said it is formatted the same ways, you could try one encoding and see if it fits your pattern and do this for the other encodings too. This is costly and would scale extremely bad but may be done for a small amount of files. But is this the right approach? Maybe try to look from a different angle and figure out how you cant get the data in another encoding. – sbstnzmr Nov 14 '18 at 14:07
  • @sezi80 : Encoding of files is predefined, so there isn't any other option to solve this. – plaidshirt Nov 15 '18 at 12:26
0

To make Scanner available to work with different encoding, you have to provide correct one to the scanner's constructor.

To define file encoding it is better to use external lib (e.g https://github.com/albfernandez/juniversalchardet). But if you definitely know possible encodings, you can check it manually according to Wikipedia

public static void main(String... args) throws IOException {
    List<String> lines = readLinesFromFile(new File("d:/utf8.txt"));
}

public static List<String> readLinesFromFile(File file) throws IOException {
    try (Scanner scan = new Scanner(file, getCharsetName(file))) {
        List<String> lines = new LinkedList<>();

        while (scan.hasNext())
            lines.add(scan.nextLine());

        return lines;
    }
}

private static String getCharsetName(File file) throws IOException {
    try (InputStream in = new FileInputStream(file)) {
        if (in.read() == 0xEF && in.read() == 0xBB && in.read() == 0xBF)
            return StandardCharsets.UTF_8.name();
        return StandardCharsets.US_ASCII.name();
    }
}
Oleg Cherednik
  • 17,377
  • 4
  • 21
  • 35