This is sort of a variation of previously asked questions, but I still am unable to find an answer, so I'm trying to distill it to the core of the problem in hopes there is a solution.
I have a database in which, for historical reasons, certain text entries are not UTF-8. Most are. And all entries made the last 3 years are. But some older entries are not.
It is important to find the non-UTF-8 characters so I can either avoid them or convert them to UTF-8 for some XML I'm trying to generate.
The server-side JavaScript I'm using has a ByteBuffer type, so I can treat any set of characters as individual bytes and examine them as needed, and do not need to use the String type, which I understand is problematic in this situation.
Is there any check of text I can do to determine if it is valid UTF-8 or not in this case?
I've been searching for a couple of months now (;_;) and still have not been able to find an answer. Yet there must be a way of doing it, because XML validators (like in the major browsers) are able to report "encoding errors" when they run across non-UTF-8 characters.
I would just like to know any algorithm for how that is done so I can try to do the same sort of test in JavaScript. Once I know which characters are bad I can convert them from ISO-8859-1 (for example) to UTF-8. I have methods for that.
I just don't know how to figure out which characters are not UTF-8. Again, I understand that using the JavaScript String type is problematic in this situation, but I do have an alternative ByteBuffer type which can handle characters on a per byte basis.
Thanks for any specific tests people can suggest.
doug