0

I have a simple txt. file and i'm looking to know if there's a way in java to do what, for example, notepad++ does with file enconding. It can detect the encoding of the file (UTF-8, ASCII, UTF-16, ...) and, if we want to, it can convert it to another encoding without transform the special characters like 'ç' or '€' in strange characters.

Thanks.

JCasper
  • 1
  • 2
  • You need to check for the Byte Order Mark(BOM) https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx – WalterM Oct 30 '15 at 09:34
  • Thank you for your comment. Yes, there are encodings with the BOM which are easily detected. But, for example, there's UTF-8 and UTF-8 without BOM. And if doesn't have BOM, the problem remains the same. – JCasper Oct 30 '15 at 09:39

2 Answers2

1

Apache Tika has an EncodingDetector with implementations for different contexts. Typically these implementations use heuristics to determine the charset with some probability. If you are interested in the details you can dive into the source.

wero
  • 32,544
  • 3
  • 59
  • 84
  • Thank you for your answer! I already wrote a simple code including Apache Tika. But the probability to guess it right it's around 60%. I really wanted to know how notepad does it :\ – JCasper Oct 30 '15 at 09:49
  • @JCaspar but this is not your original question. It is a little bit disappointing to make the effort of an answer just to learn that you meant something different. In the end you will need to dive into the sources of notepad++ when you want to know how it is implemented. – wero Oct 30 '15 at 09:59
  • Of course i don't mean the exact code of notepad++, probably i will never reach it. It was just an outflow regarding the probability of my class, and that's why i mentioned notepad++ in the original question, because it has an high sucess guess rate. The question is the one i posted in the first place, and thank you for your attention. – JCasper Oct 30 '15 at 10:10
0

You can do that in java.Already there is an another discussion about this topic on another thread. Best way to convert text files between character sets?

Community
  • 1
  • 1
RatheeshTS
  • 411
  • 3
  • 15