What is the proper encoding to use with item Reader

Question

I'm using spring batch to read csv files, when I open these files with Notepad++ I see that the used encode is encode in ANSI. Now when reading a line from a file, I notice that all accent character are not shown correctly. For example let's take this line:

Données issues de la reprise des données

It's transformed to be like this one with some special characters:

So as first solution I set the encode for my Item Reader to utf-8 but the problem still exist.

I thought that with UTF-8 encoding all my accent characters will be recognized, is that not true ? from what I heard UTF-8 is the best encoding to use to handle all character on web page for example ?

After setting my item Reader encoding to ISO-8859-1:

public class TestItemReader extends FlatFileItemReader<TestFileRow> {

    private static final Logger log = LoggerFactory.getLogger(TestItemReader.class);
    public ScelleItemReader(String path) {

        this.setResource( new FileSystemResource(path + "/Test.csv"));
        this.setEncoding("ISO-8859-1");

I cant see that these character are now displayed correctly.

As output I should write with utf-8 as encoding, did this is correct if I use ISO-8859-1 as encoding input and utf-8 as output?

"My question is that why when i try to set the itemReader encoding to utf-8 still persist ?" - Um, because the file isn't in UTF-8. Its not clear what you're asking, to be honest. — Jon Skeet, Nov 15 '17 at 09:42
I suspect you don't understand how encodings work. If a file is encoded in ISO-8859-1 and you try to read it using UTF-8, it's a bit like trying to use a PNG reader to load a JPEG image. UTF-8 can represent every character in Unicode, but that doesn't mean you can arbitrarily use it for files that are encoded in a different encoding. — Jon Skeet, Nov 15 '17 at 09:48
You might want to read http://csharpindepth.com/Articles/General/Unicode.aspx - it's phrased in terms of C#, but the concepts are the same. — Jon Skeet, Nov 15 '17 at 09:49
Ok, so there no the concept of a global encode that is capable to read any format. i should use the same encoding as mentioned on notepadd ++ — Feres.o, Nov 15 '17 at 09:50
Well "ANSI" isn't a single encoding either. If you can change what's producing the CSV files to output UTF-8, that would be the best thing. But if you can't change that, you should find out what encoding it's using (without just relying on Notepad++). — Jon Skeet, Nov 15 '17 at 09:51

PixelMaster · Answer 1 · 2023-08-23T17:20:42.627

5

I had the same problem. Input file is ANSI, and "ü" gets displayed as a square in the output.

That's because your input file is encoded in ANSI, but by default, Spring Batch assumes ISO-8859-1 encoding (6.6.2 FlatFileItemReader).
Update 2023: in newer versions, the default is UTF-8, but back when the question was posted, it was ISO-8859-1 instead, as verifiable by checking older versions of the linked document; for instance, version 4.0.1.RELEASE. I'm not sure which version was current back then, but the point remains the same either way.

Therefore, you have to set the encoding for your reader to "Cp1252" (setEncoding("Cp1252")) - that's how Java refers to ANSI encoding.

Furthermore, you will have to set your writer's encoding to "utf-8". I'm not entirely sure why it doesn't work with other encodings (that are generally able to display "ü", such as ISO-8859-1), but it works with UTF-8, so that's what I'm using.

edited Aug 23 '23 at 17:20

answered Sep 12 '18 at 17:52

PixelMaster

895
10
28

Default encoding in FlatFileItemReader is UTF-8 as per document you have shared. Please correct me if I'm wrong. – ankush__ Jul 20 '23 at 02:31
You are right @ankush__ , the default encoding set in AbstractFileItemWriter is utf-8 – Guardian Aug 23 '23 at 10:45
@ankush__ I have not worked with Spring Batch for years, but I think the default has simply changed since 2018 when I originally wrote the answer. Checking older versions of the linked document confirms this. For example, Spring Batch version 4.0.1 RELEASE shows default encoding ISO-8859-1: https://docs.spring.io/spring-batch/docs/4.0.1.RELEASE/reference/html/readersAndWriters.html#readersAndWriters – PixelMaster Aug 23 '23 at 17:17
Thanks. You are right that default encoding has changed. – ankush__ Aug 25 '23 at 04:10

score 0 · Answer 2 · answered Jun 01 '18 at 06:36

0

i use the same encoding "ISO-8859-1", all character are displayed correctly.

answered Jun 01 '18 at 06:36

zedtimi

306
1
6

What is the proper encoding to use with item Reader

2 Answers2

Linked