29

I have a CSV file with special accents and save it in Notepad by selecting UTF-8 encoding. When I read the file using Java, it reads the BOM characters too.

So I want to save this file in UTF-8 format without appending a BOM initially in Notepad.

Otherwise, is there a built-in class in Java that eliminates the BOM characters that present at beginning, when reading the contents in a file?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user1058036
  • 333
  • 1
  • 4
  • 9

7 Answers7

37
  1. Use Notepad++ - it is free and much better than Notepad. It will help to save text without a BOM using EncodingEncode in UTF-8 without BOM:

    Notepad++ v6 and olders: Screenshot of the Notepad++ Menubar -> Encoding -> Encode in UTF-8 without BOM menu in Notepad++ v6.7.9.2

    Notepad++ v7+:
    Screenshot of the Notepad++ Menubar -> Encoding -> Encode in UTF-8 without BOM menu in Notepad++ v7+

  2. When I encountered this problem in Java, I didn't find any library to parse these first three bytes (BOM). So my advice:

    • Use PushbackInputStream(in, 3).
    • Read the first three bytes
    • If it's not BOM (EF BB BF), push them back
    • Process the stream as UTF-8
Richard
  • 994
  • 7
  • 26
korifey
  • 3,379
  • 17
  • 17
  • I'm looking into this now.Will post here if I found a better way than stripping off bytes.Problem with stripping off bytes blindly is 'I cant say files are saved with only utf-8.It may be encoded in ANSI too.' – user1058036 Dec 09 '11 at 09:44
  • You don't need to strip blindly. If you analyze first two bytes and it's BOM, you have 99% probability that file is in UTF-8. Only in this case you should cut them off. Anyway please write here your solution when you'll found it it. – korifey Dec 09 '11 at 11:26
  • Worked for me! As soon as I saved it in Notepad++ the utf-8 errors went away. – PhillipKregg Oct 15 '12 at 19:19
  • 2
    Erm... anyone notice the UTF-8 BOM to be 3 bytes long and not 2 bytes? ;) It's **0xEF 0xBB 0xBF** so you will need to strip the first 3 bytes of the file!!! – e-sushi Jul 07 '13 at 04:41
  • @user1058036 the `file` command can detect utf8 without bom. Probably there are codes valid in utf8 that aren't valid ascii like df90 http://www.fileformat.info/info/unicode/char/05d0/index.htm `df` isn't valid ascii because ascii (extended ascii aside), ascii is 0-127 so 0-7f doesn't include df. – barlop Oct 21 '16 at 17:06
  • currently notepad++ describes that(utf8 without bom), as utf8 in contrast to utf8 with bom. http://imgur.com/a/wALij – barlop Oct 21 '16 at 17:15
11

I just learned from this Stack Overflow post, as @martin-geisler points out, that you can save files without the BOM in Windows Notepad, by selecting ANSI as the encoding.

I'm assuming that for more advanced uses this won't work because the resulting file is probably not the end encoding wished, but actually ANSI; but I tested and confirmed this works to save a very small .php script without BOM using only Notepad.

I learned the long, hard way that Windows' Notepad is not a true editor, although I'd like to point out for others that, despite this, it is misleadingly called up when you type "editor" on newer Windows machines, at least on one of mine.

I am currently using Emacs and other editors to solve this problem.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
olaf atchmi
  • 113
  • 1
  • 9
  • choosing ANSI in notepad ++ worked for me, but encode it to w/o BOM didnt – paul May 19 '17 at 08:38
  • 1
    I've found that special characters in text files can change the encoding if edited in word, for example we had an .xml file with a comment where someone had copied and pasted from an email/ms-word caused the UTF-8 file to change to UTF-8-BOM. I removed the special characters and was able to verify that notepad saved the file as UTF-8 without BOM when those special characters were removed. – mrswadge Jan 04 '19 at 16:07
  • 1
    Note that for any file containing only the base 128 ASCII characters (0x00-0x7F), UTF-8 is exactly identical to "ANSI". – Justin Time - Reinstate Monica Aug 22 '19 at 01:47
9

Notepad on Windows 10 version 1903 (May 2019 update) and later versions supports saving to UTF-8 without a BOM. In fact, UTF-8 is the default file format now.

Screenshot of Notepad

Reference: Windows 10 Notepad is Getting Better UTF-8 Encoding Support

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Marc Durdin
  • 1,675
  • 2
  • 20
  • 27
9

Use Notepad++ instead. See my personal blog post on it. From within Notepad++, choose the "Encoding" menu, then "Encode in UTF-8 without BOM".

ziesemer
  • 27,712
  • 8
  • 86
  • 94
  • I am aware of notepad 2 and notepad++.I wanna do that in notepad itself – user1058036 Dec 08 '11 at 18:25
  • Standard Windows notepad is not a true editor, and doesn't support any options around the BOM functionality. If you don't want to use another editor, you will need to follow the advice of one of the other answers here to properly handle the BOM within the Java code. – ziesemer Dec 08 '11 at 18:47
0

The answer is: Not at all. Notepad can't do that.

In Java you can just skip the first byte in your InputStream and be done.

Angelo Fuchs
  • 9,825
  • 1
  • 35
  • 72
  • 1
    Notepad adds some invisible bytes at the beginning of file to identify the byte order in which the current file is encoded. – user1058036 Dec 08 '11 at 18:27
  • then just skip the appropriate bytes. If notepad adds them and you want to stick to notepad than skip them and everything is fine. – Angelo Fuchs Dec 08 '11 at 18:53
  • Will check any other solution than stripping off bytes.If nothing is feasible,then I must strip off bytes.I cant say files are saved with only utf-8.It may be encoded in ANSI too. – user1058036 Dec 09 '11 at 09:47
  • @user1058036 then you want the bom to be there so you can distinguish between UTF-8 and ANSI – Angelo Fuchs Dec 09 '11 at 12:47
  • 2
    @user1058036 It's not so much that Notepad adds the BOM to Unicode files, as it is that Windows in general frequently tends to use the various Unicode BOMs as a general-purpose Unicode signature, effectively turning them into magic numbers that serve as its preferred way to detect Unicode encodings when applicable. This is _probably_ because checking for 2-4 specific bytes is more efficient than using heuristics to detect Unicode, but annoying because it breaks anything that doesn't understand the BOM; the option should be provided to save without the BOM. – Justin Time - Reinstate Monica Mar 31 '17 at 20:05
  • It is strange, though, that Notepad can run heuristics to detect Unicode (including UTF-8) even without a BOM, but doesn't provide the option to create Unicode files without BOMs. – Justin Time - Reinstate Monica Mar 31 '17 at 20:53
0

You might want to try out Notepad2 or Notepad++. Those Notepad replacements have the option for you to choose whether to output BOM.

As for a Java solution, as far as I know, Java does not understand the standard UTF-8. I googled and found Java's UTF-8 and Unicode writing is broken - Use this fix that might be the solution.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jeow Li Huan
  • 3,758
  • 1
  • 34
  • 52
0

We're using the utility BOMStripperInputStream.java to strip the BOM from our input if present.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Thomas
  • 87,414
  • 12
  • 119
  • 157