2

I have a maven project, the character encoding is set as UTF-8 in my parent pom.

    <plugin>
      <artifactId>maven-compiler-plugin</artifactId>
      <version>2.3.2</version>
      <configuration>
        <source>1.7</source>
        <target>1.7</target>
        <encoding>UTF-8</encoding>
      </configuration>
    </plugin>

But in the Java file, some characters like ` or has been used and it is causing compilation error to me.

In the Eclipse (Properties----Resource -----Text File encoding and Windows--preferences---workspace---text file encoding), I have specified the encoding as UTF-8. Please let me know how this issue can be solved.

PERL CODE TO DO CONVERSION STUFF

use strict;
use warnings;
use File::Find;
use open qw/:std :utf8/;

my $dir = "D:\\files";


find({ wanted => \&collectFiles}, "$dir");

sub collectFiles {
    my $filename = $_;
        if($filename =~ /.java$/){
        #print $filename."\n";
        startConversion($filename);
    }
}

sub startConversion{
    my $filename = $_;
    print $filename."\n";
    open(my $INFILE,  '<:encoding(cp1252)',  $filename) or die $!;
    open(my $OUTFILE, '>:encoding(UTF-8)', $filename) or die $!;
}
amon
  • 57,091
  • 2
  • 89
  • 149
user2604052
  • 51
  • 1
  • 3
  • 9
  • Have you checked that the file causing the exception is indeed UTF-8 encoded? – JB Nizet Aug 15 '13 at 11:53
  • Please note that there are 3000 java files prsent in my project, so going manually to each file and saving it in utf encoding is not the right way. Is there a perl script to solve this issue – user2604052 Aug 15 '13 at 11:54
  • when the file in notepad++, I can see the hightlight in "encode in ANSI", so i believe it is not saved in UTF-8 – user2604052 Aug 15 '13 at 11:55
  • Can you clarify what chars have been incorrectly used? It `\`` and double `\`` ? – Alastair McCormack Aug 15 '13 at 11:56
  • If you know Java, why not write it yourself in Java? Shouldn't be too hard. Otherwise, see http://superuser.com/questions/69091/batch-change-encoding-ascii-files-from-utf-8-to-iso-8859-1 – JB Nizet Aug 15 '13 at 11:56
  • compilation error message "error: unmappable character for encoding UTF8" – user2604052 Aug 15 '13 at 11:59
  • I started writing a perl script as I am not familiar with Java. use strict; use warnings; use File::Find; use open qw/:std :utf8/; my $dir = "D:\files"; find({ wanted => \&collectFiles}, "$dir"); sub collectFiles { my $filename = $_; if($filename =~ /.java$/){ #print $filename."\n"; startConversion($filename); } } sub startConversion{ my $filename = $_; print $filename."\n"; open(my $INFILE, '<:encoding(cp1252)', $filename) or die $!; open(my $OUTFILE, '>:encoding(UTF-8)', $filename) or die $!; } – user2604052 Aug 15 '13 at 12:03
  • but the script is replacing all the content in java code and not working as expected. Any idea on the issue – user2604052 Aug 15 '13 at 12:05
  • 2
    Post that code in the question, and re-tag it with Perl. You're asking why a perl script doesn't work. This hasn't much to do with Java. – JB Nizet Aug 15 '13 at 12:07
  • 1
    \` is a valid ASCII/UTF-8 so you may need to look elsewhere for the problem chars. Why don't you just work out what character encoding your files are in and set that as the encoding type in Eclipse. Your pom.xml can stay defined as UTF-8 'cause I doubt it'll have any non-ASCII chars in it – Alastair McCormack Aug 15 '13 at 12:12
  • perl code has been added, the charcter present is `` and `. this is present all over the 3000 files in javadocs – user2604052 Aug 15 '13 at 12:12
  • while compiling the `` charcters are being represented as � – user2604052 Aug 15 '13 at 12:13
  • ` is the same in UTF-8, Windows 1252, ISO-8859 and ACSII. Your conversion won't do anything. – Alastair McCormack Aug 15 '13 at 12:15
  • do u mean that the conversion from cp1252 the default windows format to utf-8 won't solve the issue. The reason why i find that the encoding is cp1252 is because somebody who had committed the code was having the the default eclipse text encoding setings as cp1252 which i beleive is the root cause of this issue – user2604052 Aug 15 '13 at 12:22
  • 1
    "it is causing compilation error to me" is not a very good description of a problem. What error, for starters? – ikegami Aug 15 '13 at 13:01
  • Do you have ‘ or `? They look almost identical in some font. – ikegami Aug 15 '13 at 13:03
  • Yes I verified ....I have ` and `` in my javadocs ... I Understand that the reason why I am getting a compilation error eventhough my encoding being utf-8 is because the encoding that was set in eclipse is cp1252....please correct if my underanding is wrong – user2604052 Aug 15 '13 at 14:52
  • One point that I should bring to everyone notice is that ---when I open the file in notepad++..the highlight of encoding point to ansi .... – user2604052 Aug 15 '13 at 14:54
  • This is the reason why I started thinking of modifying the encoding of each java file to utf-8 using a perl script ....is there anything wrong in my approach – user2604052 Aug 15 '13 at 14:56
  • Sorry I have used it\\itself ..I missed it out while copying it here .....I also verified that I was able to see all java files present in that directory .....I am able to touch all files present....but after running the script ...all java code in those files gets overwritten and ..it becomes blank files – user2604052 Aug 15 '13 at 15:09

2 Answers2

1

These two lines do not start or perform re-encoding:

open(my $INFILE,  '<:encoding(cp1252)',  $filename) or die $!;
open(my $OUTFILE, '>:encoding(UTF-8)', $filename) or die $!;

Opening a file with > truncates it, which deletes the content. See the open documentation for further details.

Rather, you have to read the data from the first file (which automatically decodes it), and write it back to another file (which automatically encodes it). Because source and target file are identical here, and because of the quirks of file handling under Windows, we should write our output to a temp file:

use autodie;  # automatic error handling :)

open my $in,  '<:encoding(cp1252)', $filename;
open my $out, '>:encoding(UTF-8)', "$filename~";  # or however you'd like to call the tempfile
print {$out} $_ while <$in>;  # copy the file, recoding it
close $_ for $in, $out;

rename "$filename~" => $filename;  # BEWARE: doesn't work across logival volumes!

If the files are small enough (hint: source code usually is), then you could also load them into memory:

use File::Slurp;

my $contents = read_file $filename, { binmode => ':encoding(cp1252)' };
write_file $filename, { binmode => ':encoding(UTF-8)' }, $contents;
amon
  • 57,091
  • 2
  • 89
  • 149
  • So this means that writing the perl was indeed the right approach to attack this problem ....as ~3000 files are involved – user2604052 Aug 15 '13 at 16:21
  • @user2604052 I don't know, as I wouldn't think that re-encoding a file would be neccessary under default settings, or when no non-ASCII characters are used. My answer only points out how your current Perl script couldn't have worked. – amon Aug 15 '13 at 16:30
  • Yes you are correct ....this is not necessary if we follow utf-8 encoding from the beginning of a project itself .....but in my case since files are in ansi format .....I believe this should convert them from ansi to utf-8 and hence solve the compilation error – user2604052 Aug 15 '13 at 16:42
0

If you're on Linux or Mac OS X, you can use iconv to convert files to UTF-8. Java 1.7 does not allow for non-utf8 characters, but Java 1.6 does (although it produces a warning). I know because I have Java 1.7 on my Mac, and I can't compile some of our code because of this while Windows users and our Linux continuous build machine can because they both still use Java 1.6.

The problem with your Perl script is that you're opening a file for reading and the same file for writing, but you're using the same file name. When you open the file for writing, you are deleting its contents.

#! /usr/bin/env perl
use warnings;
use strict;
use feature qw(say);

use File::Find;

use strict;
use warnings;
use autodie;

use constant  {
    SOURCE_DIR       => 'src',
};


my @file_list;
find {
    next unless -f;
    next unless /\.java$/;
    push $file_list, $File::Find::name;
}, SOURCE_DIR;

for my $file ( @file_list ) {
    open my $file_fh, "<:encoding(cp1252)", $file;
    my @file_contents = <$file_fh>;
    close $file_fh;

    open my $file_fh, ">:encoding(utf8)", $file;
    print {$file_fh} @file_contents;
    close $file_fh;
}

Note I am reading the entire file into memory which should be okay with Java source code. Even a gargantuan source file (10,000 lines long with an average line length of 120 characters) will be just over 1.2 megabytes. Unless you're using a TRS-80, I a 1.2 megabyte file shouldn't be a memory issue. If you want to be strict about it, use File::Temp to create a temporary file to write to, and then use File::Copy to rename that temporary file. Both are standard Perl modules.

You can also enclosed the entire program in the find subroutine too.

David W.
  • 105,218
  • 39
  • 216
  • 337