1

I have a text file "abc.txt" encoded in utf-8 data being a set of emoticons from wikipedia page:
(^_^) happy

My code extracts this info from the file to the netbeans stdout
My code:

public static void main(String[] args) throws FileNotFoundException {
    Scanner sc=new Scanner(new File("abc.txt"));
    while(sc.hasNext()){
        System.out.println(sc.nextLine());
    }
}

In netbeans the output is this :

enter image description here

While in console the output is:
enter image description here

What is this character?
And how do I remove this?

boxed__l
  • 1,334
  • 1
  • 10
  • 24
  • Delete all content from your file and write it your self just to check what went wrong in your case. – Noman ali abbasi Dec 20 '13 at 06:26
  • @Nomanaliabbasi : I manually typed `happy` in notepad, saved as 'abc.txt' in UTF-8 encoding and tried the program. Gives the same non-printable character in the beginning. (BOM apparently) – boxed__l Dec 20 '13 at 06:56
  • Changing the encoding from UTF-8 to unicode seems to solve the problem. [BOM WIKI](http://en.wikipedia.org/wiki/Byte_order_mark) – boxed__l Dec 20 '13 at 07:02
  • "If you save a file as UTF-8, Notepad will put the BOM (byte order mark) EF BB BF at the beginning of the file." [here](http://stackoverflow.com/questions/6769311/how-windows-notepad-interpret-characters). – boxed__l Dec 20 '13 at 07:06

2 Answers2

2

The console output looks like a UTF-8 encoded Byte Order Mark (BOM, U+FEFF), bytes 0xEF 0xBB 0xBF, misinterpreted according to some legacy 8-bit character encoding.

Either save the file without BOM, or make your program recognize and skip the BOM at the start of data.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
1

There is a non printable character at the beginning of the file added by a widows editor. It is necessary to remove it in the file or skip it by Java code.

JosefN
  • 952
  • 6
  • 8
  • Thanks for your input. I manually typed `happy` in notepad, saved as 'abc.txt' in UTF-8 encoding and tried the program. Gives the same non-printable character in the beginning. Is it standard for UTF-8 documents to do so? – boxed__l Dec 20 '13 at 06:38
  • 1
    Sorry, I have not used windows for ages:), Windows editors added a special two bytes at the begging of the file to indicate that it is UTF8 document :), simply remove them. I can not recommend windows tool to do it. Try editor in Netbeens. – JosefN Dec 20 '13 at 06:40