What happens if your input file contains some unsupported character?

Question

I have this text file which might contain some unsupported characters in the Latin1 character set, which is the default character set of my JVM.

What would those characters be turned into when my java program tries to read from the file? Concretely, supposed I had a 2-byte long character in the file, would it be read as a one-byte character (because each character in Latin1 is only 1-byte long)?

Thanks,

I'm having trouble setting the default character set of my JVM . (and was a bit afraid of messing it up!) — One Two Three, Jun 04 '12 at 22:50
If you read a file with one character encoding, using a different encoding, you are likely to simply get nonsense (e.g., reading an ISO 8851 file using UTF-8 encoding is likely to get some very weird characters). This is like reading a base 13 number using a base 7 radix. Don't expect sensible results. KNOW THY FILE ENCODING! — Ira Baxter, Jun 04 '12 at 22:53
I bet you’re wrong about what the default is. I know no platform that uses ISO-8859-1 as its default platform encoding. Macs use MacRoman. Microsoft uses Windows 1252 — which is ***not*** Latin1. Which system is this for? — tchrist, Jun 05 '12 at 03:05
I didn't say the default charset of the JVM was up to my platform. It's actually set to a 'agreed-upon' default value — One Two Three, Jun 05 '12 at 03:24

Stephen C · Answer 1 · 2012-06-05T02:34:03.090

I can't use the InputStreaReader option, because the file has to be read with Latin1.

And

I have this text file which might contain some unsupported characters in the Latin1 character set ...

You have contradictory requirements here.

Either the file is LATIN-1 (and there no "unsupported characters") or it is not LATIN-1. If it is not LATIN-1, you should be trying to find out what character set / encoding it really is, and use that one instead of LATIN-1 to read the file.

As other answers / comments have explained, you can either change the JVM's default character set, or specify a character set explicitly when you open the Reader.

I'm having trouble setting the default character set of my JVM .

Please explain what you are trying and what problems you are having.

(and was a bit afraid of messing it up!)

COWARD! :-)

FWIW - if you try to read a data stream in (say) LATIN-1 and the data stream is not actually in LATIN-1, then you can expect the following:

Characters that encode the same in LATIN-1 and the actual character set will be passed unharmed.
Characters that don't encode the same, will either be replaced by a character that means "unknown character" (e.g. a question mark), or will be garbled. Which happens will depend on whether that byte or byte sequence at issue encodes a valid (but wrong) character, or no character at all.

The net result will be partially garbled text. The garbling may or may not be reversible, depending on exactly what the real character set and characters are. But it is best to avoid "going there" ... by using the RIGHT character set to decode in the first instance.

That's not a contradictory requirements. Either you misunderstood what I said, or I didn't put it as clearly as I thought it'd be. What I meant was `I have this text file which might contain some characters that are NOT supported in the Latin1 character set`. The text file comes from an arbitrary place, and I have no way of knowing what character set it was encoded with. — One Two Three, Jun 04 '12 at 23:15
@OneTwoThree So the file doesn't have 'to be read with Latin1'. It has to be read using its own charset, whatever that is. — user207421, Jun 04 '12 at 23:43
Well, OK. I guess I'll close this question. Thanks for all of your inputs. — One Two Three, Jun 05 '12 at 01:45
(I didn't misunderstand what you said. The problem is that you didn't say / write what you meant ... and I can't read your mind.) — Stephen C, Jun 05 '12 at 03:23

score 1 · Accepted Answer · answered Jun 04 '12 at 22:48

1

First of all you can specify the character set to use when reading a file. See for example: java.io.InputStreamReader

Secondly. Yes if reading using a 1 byte character set then each byte will be used to map to one character.

Thirdly: Test it and you shall see, beyond doubt what actually happens!

answered Jun 04 '12 at 22:48

Mattias Isegran Bergander

11,811
2
41
49

I can't use the InputStreaReader option, because the file has to be read with Latin1. So you're saying if I have a weird 2-byte unicode character in the file, that character will end up being two characters upon being read? – One Two Three Jun 04 '12 at 22:52
If the file has to be read with Latin1, then the file has to contain Latin1 characters or they will not be read properly. – jahroy Jun 04 '12 at 22:59
@jahroy: Yes, I understand that the characters will not be read properly. But I'm trying to find out what the expected behavior would be. – One Two Three Jun 04 '12 at 23:16
1

Pulsar's answer seems to indicate that your unidentified character WILL be read as two characters. You probably need to know the original character set in order to turn them into something meaningful... http://www.joelonsoftware.com/articles/Unicode.html – jahroy Jun 05 '12 at 00:06

score 0 · Answer 3 · edited May 23 '17 at 12:12

0

If you don't know the charset you'll have to guess it. This is tricky and error prone.

Here is a question regarding this issue: How can I detect the encoding/codepage of a text file

Check out how you can fool notepad into guessing wrong.

edited May 23 '17 at 12:12

Community

1
1

answered Jun 05 '12 at 02:45

Sarel Botha

12,419
7
54
59

What happens if your input file contains some unsupported character?

3 Answers3