0

How can I identify all NON UTF8 characters from a given file?

We need to write it in C# and be able to execute it in a SSIS environment. After the execution we need to find out and check all the wrong occurrences given eventually their line number into the input file.

Assumptions: - file is a csv well formatted (in our case), - new line has CR LF

  • 1
    We? So you and me? – Rand Random Jan 22 '19 at 13:54
  • Possible duplicate of [How to check for invalid UTF-8 characters?](https://stackoverflow.com/questions/50761133/how-to-check-for-invalid-utf-8-characters) – user1519979 Jan 22 '19 at 13:55
  • Show us what you have tried so-far? There is lots of good info on this Wikipedia page: https://en.wikipedia.org/wiki/UTF-8 . Test for the limits of that and you will know if your codes are illegal? An important test is to check that each high bit set code is either a valid lead byte or one of the correct number of trail bytes; also that low codes are never in the trail bytes. – Gem Taylor Jan 22 '19 at 14:01
  • @user1519979: I think it was different. Maybe I should have mentioned the post you proposed from the sources. – Michele Tamburini Jan 22 '19 at 14:41
  • @GemTaylor: thanks for the wiki link. I don't know but but I foud more comprehensible [Daniel Lemire's blog](https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/) – Michele Tamburini Jan 22 '19 at 14:41
  • @RandRandom: :) (I was supposing our dev team) – Michele Tamburini Jan 22 '19 at 14:42
  • Why would you receive a CSV file encoded with UTF-8 and it not be valid UTF-8? Is the source unreliable? Why would you not stop at the first occurrence of corruption and send it back? – Tom Blodget Jan 22 '19 at 17:44
  • @TomBlodget: no in fact the source is unreliable. I need to investigate the file in order to let it be cleaned by the offices of competence – Michele Tamburini Jan 23 '19 at 13:18
  • I understand. You might consider, though, that our users might want to know that we have lost some of their data due to our mishandling of it. That's what � is for. But, if you need to invest human effort in fixing up the data, then, yes, hunting down the bytes and making a judgment in context might be worthwhile. – Tom Blodget Jan 23 '19 at 15:47

2 Answers2

0

After a bit of research, we collected some hints:

  1. Stackoverflow: Determine a string's encoding in C#
  2. utf8check: https://archive.codeplex.com/?p=utf8checker
  3. Daniel Lemire's blog: https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/

Here's what we have learned:

  1. we needed to scan byte by byte,
  2. the class from which to start
  3. the algorithm for checking UTF8 (well implemented from point 2)

SO: we needed to improve the version of utf8checker class in order to keep scanning the entire file and not finishing at first wrong occurrence. After the complete scanning the code produces a log file listing all the NON utf8 occurrences.

The following code is working in our case. It's execute in a SSIS Script Task and it reads the filename from the input parameter.
Maybe could be improved further.

 /*
   Microsoft SQL Server Integration Services Script Task
   Write scripts using Microsoft Visual C# 2008.
   The ScriptMain is the entry point class of the script.
*/

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;
using System.Text;
using System.Linq;
using System.Collections.Generic;

namespace ST_5c3d8ec1340c4ab9bbb71cb975760e42.csproj
{

    [System.AddIn.AddIn("ScriptMain", Version = "1.0", Publisher = "", Description = "")]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {

        public void Main()
        {

            String fileToCheck, logFileName;
            bool OK_UTF8;
            IUtf8Checker fileCheckerUtf8 = new Utf8Checker();
            List<IErrorUtf8Checker> errorsList;
            System.IO.StreamWriter logFile;

            try
            {
                fileToCheck = Dts.Variables["User::InputFile"].Value.ToString();

                logFileName = fileToCheck + "_utf8check.log";

                if (File.Exists(fileToCheck))
                {
                    OK_UTF8 = fileCheckerUtf8.Check(fileToCheck);

                    if (OK_UTF8 == false)
                    {
                        errorsList = fileCheckerUtf8.GetErrorList();

                        logFile = new StreamWriter(logFileName);

                        int i = 0;
                        foreach (ErrorUtf8Checker e in errorsList)
                        {
                            logFile.WriteLine(++i + ") " + e.ToString());
                        }
                        logFile.Close();                        
                    }

                }
                //exit always with success. It writes a log file if any warning occurs
                Dts.TaskResult = (int)ScriptResults.Success;


            }
            catch (DecoderFallbackException eUTF)
            {
                Console.Write(eUTF.ToString());
                Dts.TaskResult = (int)ScriptResults.Failure;
            }
            catch (Exception e)
            {
                Console.Write(e.ToString());
                Dts.TaskResult = (int)ScriptResults.Failure;
            }

        }

        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion


        /**
        * PrintOnSSISConsole
        * Used to print a string s into the immediate console of SSIS
        */
        public void PrintOnSSISConsole(String s)
        {
            System.Diagnostics.Debug.WriteLine(s);
        }



        /// <summary>
        /// Interface for checking for utf8.
        /// </summary>
        public interface IUtf8Checker
        {
            /// <summary>
            /// Check if file is utf8 encoded.
            /// </summary>
            /// <param name="fileName"></param>
            /// <returns>true if utf8 encoded, otherwise false.</returns>
            bool Check(string fileName);

            /// <summary>
            /// Check if stream is utf8 encoded.
            /// </summary>
            /// <param name="stream"></param>
            /// <returns>true if utf8 encoded, otherwise false.</returns>
            bool IsUtf8(Stream stream);

            /// <summary>
            /// Return a list of found errors of type of IErrorUtf8Checker
            /// </summary>
            /// <returns>List of errors found through the Check metod</returns>
            List<IErrorUtf8Checker> GetErrorList();


        }

        public interface IErrorUtf8Checker
        {

        }

        /// <summary>
        /// http://anubis.dkuug.dk/JTC1/SC2/WG2/docs/n1335
        /// 
        /// http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
        /// 
        /// http://www.unicode.org/versions/corrigendum1.html
        /// 
        /// http://www.ietf.org/rfc/rfc2279.txt
        /// 
        /// </summary>
        public class Utf8Checker : IUtf8Checker
        {

            // newLineArray = used to understand the new line sequence 
            private static byte[] newLineArray = new byte[2] { 13, 10 };
            private int line = 1;
            private byte[] lineArray = new byte[2] { 0, 0 };

            // used to keep trak of number of errors found into the file            
            private List<IErrorUtf8Checker> errorsList;

            public Utf8Checker()
            {
                this.errorsList = new List<IErrorUtf8Checker>();
            }

            public int getNumberOfErrors()
            {
                return errorsList.Count();
            }

            public bool Check(string fileName)
            {
                using (BufferedStream fstream = new BufferedStream(File.OpenRead(fileName)))
                {
                    return this.IsUtf8(fstream);
                }
            }

            public int getLine()
            {
                return line;
            }

            public List<IErrorUtf8Checker> GetErrorList()
            {
                return errorsList;
            }

            /// <summary>
            /// Check if stream is utf8 encoded.
            /// Notice: stream is read completely in memory!
            /// </summary>
            /// <param name="stream">Stream to read from.</param>
            /// <returns>True if the whole stream is utf8 encoded.</returns>
            public bool IsUtf8(Stream stream)
            {
                int count = 4 * 1024;
                byte[] buffer;
                int read;
                while (true)
                {
                    buffer = new byte[count];
                    stream.Seek(0, SeekOrigin.Begin);
                    read = stream.Read(buffer, 0, count);
                    if (read < count)
                    {
                        break;
                    }
                    buffer = null;
                    count *= 2;
                }
                return IsUtf8(buffer, read);
            }

            /// <summary>
            /// 
            /// </summary>
            /// <param name="buffer"></param>
            /// <param name="length"></param>
            /// <returns></returns>
            public bool IsUtf8(byte[] buffer, int length)
            {
                int position = 0;
                int bytes = 0;
                bool ret = true;
                while (position < length)
                {
                    if (!IsValid(buffer, position, length, ref bytes))
                    {
                        ret = false;
                        errorsList.Add(new ErrorUtf8Checker(getLine(), buffer[position]));

                    }
                    position += bytes;
                }
                return ret;
            }

            /// <summary>
            /// 
            /// </summary>
            /// <param name="buffer"></param>
            /// <param name="position"></param>
            /// <param name="length"></param>
            /// <param name="bytes"></param>
            /// <returns></returns>
            public bool IsValid(byte[] buffer, int position, int length, ref int bytes)
            {
                if (length > buffer.Length)
                {
                    throw new ArgumentException("Invalid length");
                }

                if (position > length - 1)
                {
                    bytes = 0;
                    return true;
                }

                byte ch = buffer[position];
                char ctest = (char)ch; // for debug  only
                this.detectNewLine(ch);

                if (ch <= 0x7F)
                {
                    bytes = 1;
                    return true;
                }

                if (ch >= 0xc2 && ch <= 0xdf)
                {
                    if (position >= length - 2)
                    {
                        bytes = 0;
                        return false;
                    }
                    if (buffer[position + 1] < 0x80 || buffer[position + 1] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }
                    bytes = 2;
                    return true;
                }

                if (ch == 0xe0)
                {
                    if (position >= length - 3)
                    {
                        //bytes = 0;
                        return false;
                    }

                    if (buffer[position + 1] < 0xa0 || buffer[position + 1] > 0xbf ||
                        buffer[position + 2] < 0x80 || buffer[position + 2] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }
                    bytes = 3;
                    return true;
                }


                if (ch >= 0xe1 && ch <= 0xef)
                {
                    if (position >= length - 3)
                    {
                        //bytes = 0;
                        return false;
                    }

                    if (buffer[position + 1] < 0x80 || buffer[position + 1] > 0xbf ||
                        buffer[position + 2] < 0x80 || buffer[position + 2] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }

                    bytes = 3;
                    return true;
                }

                if (ch == 0xf0)
                {
                    if (position >= length - 4)
                    {
                        //bytes = 0;
                        return false;
                    }

                    if (buffer[position + 1] < 0x90 || buffer[position + 1] > 0xbf ||
                        buffer[position + 2] < 0x80 || buffer[position + 2] > 0xbf ||
                        buffer[position + 3] < 0x80 || buffer[position + 3] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }

                    bytes = 4;
                    return true;
                }

                if (ch == 0xf4)
                {
                    if (position >= length - 4)
                    {
                        //bytes = 0;
                        return false;
                    }

                    if (buffer[position + 1] < 0x80 || buffer[position + 1] > 0x8f ||
                        buffer[position + 2] < 0x80 || buffer[position + 2] > 0xbf ||
                        buffer[position + 3] < 0x80 || buffer[position + 3] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }

                    bytes = 4;
                    return true;
                }

                if (ch >= 0xf1 && ch <= 0xf3)
                {
                    if (position >= length - 4)
                    {
                        //bytes = 0;
                        return false;
                    }

                    if (buffer[position + 1] < 0x80 || buffer[position + 1] > 0xbf ||
                        buffer[position + 2] < 0x80 || buffer[position + 2] > 0xbf ||
                        buffer[position + 3] < 0x80 || buffer[position + 3] > 0xbf)
                    {
                        //bytes = 0;
                        return false;
                    }

                    bytes = 4;
                    return true;
                }

                return false;
            }

            private void detectNewLine(byte ch)
            {
                // looking for second char for new line (char 13 feed)
                if (this.lineArray[0] == newLineArray[0])
                {
                    if (ch == newLineArray[1])
                    {
                        // found new line
                        this.lineArray[1] = ch;
                        line++;
                        // reset work array: lineArray
                        this.lineArray[1] = 0;
                    }
                    // we have to reset work array because CR(13)LF(10) must be in sequence
                    this.lineArray[0] = 0;

                }
                else
                {
                    // found first character (char 10 return)
                    if (ch == newLineArray[0])
                    {
                        this.lineArray[0] = ch;
                    }
                }
            }
        }

        public class ErrorUtf8Checker : IErrorUtf8Checker
        {
            private int line;
            private byte ch;

            public ErrorUtf8Checker(int line, byte character)
            {
                this.line = line;
                this.ch = character;
            }

            public ErrorUtf8Checker(int line)
            {
                this.line = line;
            }

            public override string ToString()
            {
                string s;
                try
                {
                    if (ch > 0)
                    {
                        s = "line: " + line + " code: " + ch + ", char: " + (char)ch;
                    }
                    else
                    {
                        s = "line: " + line;
                    }
                    return s;
                }
                catch (Exception e)
                {
                    Console.Write(e.ToString());
                    return base.ToString();
                }
            }
        }



    }
}

Given the example:

Hello world test UTF8
err 1: °
text ok line 3
err 2: ò
errs 3: à è § °
end file 

the code posted will create a new file containing:

1) line: 2 code: 176, char: °
2) line: 4 code: 242, char: ò
3) line: 5 code: 224, char: à
4) line: 5 code: 232, char: è
5) line: 5 code: 167, char: §
6) line: 5 code: 176, char: °
0

When you load your file into byte array and then attempt to load it to the string invalid UTF8 characters will be replaced by ? (question marks). Your code should look something like this:

 byte[] data = File.ReadAllBytes(pathToYourFile);
 string result = Encoding.UTF8.GetString(data);

Next, you can take for example cleaning steps???

Yuri
  • 2,820
  • 4
  • 28
  • 40
  • Thanks @Yuri I tried to use the Encoding class but I wasn't able to let it work... obviously my fault... – Michele Tamburini Jan 22 '19 at 14:45
  • I tried the following code: `byte[] data = File.ReadAllBytes(fileToCheck);` `string result = Encoding.UTF8.GetString(data);` `logFileName2 = fileToCheck + "_utf8check2.log";` `logFile2 = new StreamWriter( File.Open(logFileName2, FileMode.Create),Encoding.UTF8);` `logFile2.Write(result);` `logFile2.Flush();` It produces a file with special char. Bu then I need to find out what original charactec was and the line. – Michele Tamburini Jan 23 '19 at 13:16