converting unicode code points to string in entire file

Question

I am running a PHP web application which accepts the file from the user, append some data to it and provide user new files to download.

Occasionally I get files which contains invisible control characters like BOM, zero-width-no-break-space etc. in it (In plain text editor it does not show but when checked with 'less' command or in 'vi' editor, it shows <U+200F>, <U+FEFF>, <U+0083> etc) and that causes an issue with our processing. Currently, I have list of few such code points which I remove from the file using 'sed' before processing it (below is the command I use). Then I also use "iconv" to convert non-utf files to utf-8.

exec("sed -i 's/\xE2\x80\x8F|\xC2\x81|\xE2\x80\x8B|\xE2\x80\x8E|\xEF\xBB\xBF|\xC2\xAD|\xC2\x89|\xC2\x83|\xC2\x87|\xC2\x82//g' 'my_file_path'");

But the list of such character is increasing and when not handled properly, such characters are causing file encoding to be 'unknown-8bit' which is not proper and will show corrupted content. Now I need to for a solution which should be efficient and does not need me to the lookup code table.

How should I do this so it automatically handles every code point in the file and doesn't need to maintain a list of such code to replace. I am open for Perl /python/bash script solution also.

P.S. I need to support all languages (not just US ascii or extended ascii) and I also dont want any data loss.

I suspect that you want to convert your input text into a standard encoding like Windows 1252 (255 characters) or even plain ASCII (only 127 characters). See similar questions like this one https://stackoverflow.com/questions/14933929/php-converting-utf-8-string-to-ansi and https://stackoverflow.com/questions/910793/detect-encoding-and-make-everything-utf-8 — andreaplanet, Feb 16 '19 at 10:05
I suspect you're using the term "Unicode code point" incorrectly. As andreaplanet suggest, you probably just want to convert data to 7-bit US-ASCII. However, to do that reliably you need two things: 1) Know the input encoding (you cannot *guess*). 2) Determine what you want to happen with characters outside the US-ASCII range (e.g., strip or convert to `?`). — Álvaro González, Feb 16 '19 at 10:28
@andreaplanet no, I think I might not have explained my question properly, but I dont want to convert strings to Windows 1252 or plain ascii, Windows 1252 does not support all languages f.e Chinese or Arabic. But I want to allow so I want files I create to be utf-8. — FMoridhara, Feb 18 '19 at 06:33
@ÁlvaroGonzález, just like I said above, I dont want to be limited to US ascii. I need to allow all languages without any data loss. Also, I am guessing encoding and converting it to UTF-8 but problem is I am not able to handles when some characters are represented as code point (like or or ) in the file. — FMoridhara, Feb 18 '19 at 06:36
Your "in plain text editor it does not show but I can see it with 'less' command or in 'vi' editor" addendum clarifies this a little bit. Those are not code points. Those are a plain-text hexadecimal representation internal to and invented by your current editor/viewer to display the raw bytes of characters that are either non-printable or not supported by the tool display routines. A Unicode code point is the catalogue number of a given character and often it doesn't even match its UTF-8 encoding. — Álvaro González, Feb 18 '19 at 07:22
The solution actually depends on the problem you want to solve. You have, for instance, `\xe2\x80\x8f` which is [U+200F RIGHT-TO-LEFT MARK](https://www.fileformat.info/info/unicode/char/200f/index.htm). If you want to support all languages including Arabic or Hebrew you cannot just strip it or replace it with a letter. I think you'd first need to figure out why this character is not acceptable in the first place. — Álvaro González, Feb 18 '19 at 07:24
@ÁlvaroGonzález thanks very much, this made me re-think about my problem footprint. Actually I might not need to handle printable characters, that will be the duty of editor application. I just need to rip off invisible control characters which is causing issue with another program I use to process uploaded files and files are turning to unknown 8-bit. — FMoridhara, Feb 18 '19 at 13:35
You have to deal with the concept of encoding. The bytes \xE2\x80\x8F have no meaning unless you know what encoding was used for your original text. 1) find out in what encoding format you are receiving the data. Probably UTF-8. That's the reason for the existence of the BOM Marker. It tell's you which Unicode encoding is used UTF8, UTF16 LE, UTF16 BE, UTF32 the most common. — andreaplanet, Feb 21 '19 at 09:58
2) Once you know the encoding you can separate the bytes and find out individual "characters", better said, the individual "code points". A "character"/"code point" is a number between 1 and 1 million circa, effectively used are over 100'000 different characters. With UTF8 a single character can take a variable amount of bytes where the well known ASCII characters like abc123 have the same value as the ASCII encoding. 3) you could convert a UTF8 encoding into a UTF32 encoding, in that case each character / code point takes exactly 4 bytes, a 32 bit integer. Easier to process. — andreaplanet, Feb 21 '19 at 09:59
4) you remove or convert the unwanted codepoints from your text into a different codepoint. Which one depends on your needs, your text editor. 5) at last you save your text into a specific Unicode encoding format supported by your text editor. If your text editor doesnt support a Unicode Encoding format, then you may can convert your text into a encoding based on a Code Page, for example the Windows 1252 Code page containing most western characters like èüéàä. — andreaplanet, Feb 21 '19 at 10:02
Of course when you convert your text into Windows 1252 Code Page then instead of the over 100'000 characters that can be used with Unicode, you will only have 255 possible characters. Again I say that Unicode isn't easy to fully understand. For example, for some single printable characters you need more than one codepoint. And for other characters you can have two different codepoints representing graphically the same character. — andreaplanet, Feb 21 '19 at 10:11
A codepoint is a number that represents a single printable character or part of a printable character. Your text cannot contain codepoints. It must have a encoding (UTF8, UTF32) and from that encoding you get the list of codepoints. For ASCII characters it is easy, each ASCII character, it's number is identical to the UTF8 encoding which is identical to the codepoint value. But for other characters it's different. You have the codepoint value that is absolute and the bytes needed to represent the codepoint in a specific encoding. — andreaplanet, Feb 21 '19 at 10:16

converting unicode code points to string in entire file

0 Answers0