0

I converted text to Base64 byteArray without any problem. Unfortunately, the converted string needs to start with "PD". It means i should encode it to UTF-8 without BOM not with BOM. I started several codes and everything on the net. But, I could not succeed. Any help is appreciated.

Thank you so much.

Regards Alper

public static byte[] convertToByteArray(String strToBeConverted) {
    return strToBeConverted.getBytes(StandardCharsets.UTF_8);
}
Tonyukuk
  • 5,745
  • 7
  • 35
  • 63
  • http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java maybe –  Aug 01 '16 at 11:02
  • The UTF-8 BOM is two bytes, always, at the beginning of the data. So you could just chop those off / skip over them when using the converted data. – T.J. Crowder Aug 01 '16 at 11:02

1 Answers1

1
return strToBeConverted.replaceFirst("^\uFEFF", "").getBytes(StandardCharsets.UTF_8);

The BOM is Unicode code point U+FEFF.

Removing it would mean to check first whether it indeed is present. String.replaceFirst is costly, as it uses regular expression matching, but fine here.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I fixed it ... Thank you Joop. The original file was wrong. I fixed it and run your code now i have a UTF8 without bom file. Cheers – Tonyukuk Aug 01 '16 at 12:48
  • 2
    Like you said, `replaceFirst()` is costly, and unnecessary. It would be simpler to just check if the first codepoint in the string is a BOM and if so then skip it, eg: `if ((strToBeConverted.length() > 0) && (strToBeConverted.codePointAt(0) == 0xFEFF)) strToBeConverted = strToBeConverted().substring(1); return strToBeConverted.getBytes(StandardCharsets.UTF_8);` – Remy Lebeau Aug 02 '16 at 23:30
  • @RemyLebeau thanks for the code; `charAt` would be possible too, but nowadays code points are the more logical choice. Note (for readers): substring does not make a copy of the char array content, so is fast and not expensive. – Joop Eggen Aug 03 '16 at 06:10