Working with special characters derived from a filename in a zip file

Question

This question concerns a Tomcat 7 web application, which is connected to a MySQL (5.5.16) database.

When I open a zip file, That has filenames encoded in windows-1252 charset, the characters seem to be interpreted correctly by Java:

ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );
Enumeration entries = zf.entries();
while( entries.hasMoreElements() ) {
    ZipEntry ze = ( ZipEntry ) entries.nextElement();
    if( ! ze.isDirectory() ) {
        String name = ze.getName();
        System.out.println( name ); //prints correct filenames, e.g. café.pdf
    }
}

Omitting the Charset object in the ZipFile constructor would cause an exception. The filenames in the zip file are printed correctly to standard output, including diacritics. But, when I subsequently try to store the filename in a database, the e-acute is replaced with a question mark (as seen with the mysql console client). I had no problems inserting special characters from the web application into MySQL before.

When I execute an INSERT with é in Java source code:

statement.executeUpdate( "insert into files (filename) values ('café.pdf')" );

the é shows up well in MySQL.

Also, my log file shows a comma instead of é: caf‚.pfd

Does anyone know what could be happening here?

How do you open the connection to the MySQL server? What classes/libraries/services do you use? — hellodanylo, Jun 29 '12 at 10:43
To connect Java with MySQL I use a javax.sql.DataSource resource with: driverClassName="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:3306/workflow?zeroDateTimeBehavior=convertToNull" — reus, Jun 29 '12 at 10:47
Please, execute the following query from your Java app. and show us the output: `show variables like 'char%'` — hellodanylo, Jun 29 '12 at 10:49
show variables like 'char%' output: character_set_client - latin1 character_set_connection - latin1 character_set_database - latin1 character_set_filesystem - binary character_set_result - character_set_server - latin1 character_set_system - utf-8 character_sets_dir - c:\xampp\mysql\share\charsets\ — reus, Jun 29 '12 at 10:56
It's not empty, I hit the enter key accidentally... Most character sets are latin1, except character_set_system which is utf-8, and character_set_filesystem, which is binary. Run from the mysql client the output is different: some sets are cp850. — reus, Jun 29 '12 at 11:09
Is there any reasons you need to have file names in zip in windows-1252? JDBC+MySQL is always a painful thing when it comes to character sets. You would be way better off, just using utf8 everywhere. Or do you really need to have those file names in win.-1252? — hellodanylo, Jun 29 '12 at 11:32
Yes, the zip files come from different sources and thus could be any character set (which is another problem). Anyway, thanks for your ideas. — reus, Jun 29 '12 at 12:02

score 1 · Answer 1 · answered Jun 29 '12 at 12:19

As you mentioned in the comments section, the incoming data (zipped files' names) can be in different character sets. This is going to be an issue to you, because you are using MySQL+JDBC link, and it gives you a lot of limitations (like one character set per column in MySQL, and only one character set per connection in JDBC).

Therefore, I would recommend you to switch the character sets (look for variables like character_set_server and character_set_connection) on the MySQL side to UTF8, because it will enable you to transfer and store almost any character that you may receive. See here on how to properly set up your MySQL server. Note, that settings the MySQL server might be challenging, so don't hesitate to PM for additional help. JDBC will automatically adjust to the server's character_set_connection variable, so you don't have to change anything in your Java application.

The one thing you WILL have to change in your application is you would have to convert all incoming data to UTF8 in order to send and store it on the MySQL server.

Good luck.

score 0 · Answer 2 · answered Jun 29 '12 at 10:43

0

In the table where you store the data, make sure you use the correct collation to be able to store the e-acute character

answered Jun 29 '12 at 10:43

sed

5,431
2
24
23

He said that a direct query succeeds, which means that everything is alright on the server's side. – hellodanylo Jun 29 '12 at 10:46

score 0 · Accepted Answer · edited May 23 '17 at 10:34

The issue is resolved. This post suggested that the encoding of filenames in a zip file might not be windows-1252 but rather IBM437. Changing the Charset from:

ZipFile zf = new ZipFile( zipFile, Charset.forName( "windows-1252" ) );

to

ZipFile zf = new ZipFile( zipFile, Charset.forName( "IBM437" ) );

gave the desired result: when saving the acquired filename in MySQL, it was stored correctly with é.

What went wrong?

Printing out the filenames contained in the zip file to standard output with

System.out.println( name );

made me wrongly assume that the filenames in the zip file were interpreted well: when I used windows-1252 encoding to open the zip file, the filename was printed to standard output nicely with diacritic: café.pdf. Using other character encodings, different symbols appeared instead of the é.

But when printing the Unicode value of the é-char with the help of this answer, I was able to see that when opening the zip file with windows-1252 encoding, the actual Unicode value was NOT \u00e9 (latin small letter e with acute), but \u201a (single low-9 quotation mark). When I opened the ZipFile with IBM437 charset the correct Unicode value DID appear.

Of course when printing a String to standard output with PrintStream, the PrintStream is also associated with a certain character encoding. From the PrintStream Javadoc:

All characters printed by a PrintStream are converted into bytes using the platform's default character encoding.

I am working on Windows XP. When I created a new PrintStream

out = new PrintStream( System.out, true, "IBM437" );

everything made sense: opening the zip file with IBM437 character encoding, and using the new PrintStream, é was printed correctly.

There Ain't No Such Thing As Plain Text.

Working with special characters derived from a filename in a zip file

3 Answers3