Reading multi language text file in c#

Question

I have to read a text file which can contains char from following languages: English, Japanese, Chinese, French, Spanish, German, Italian

My task is to simply read the data and write it to new text file (placing new line char \n after 100 chars).

I cannot use File.ReadAllText and File.ReadAllLines as file size can be more than 500 MB. So I have written following code:

using (var streamReader = new StreamReader(inputFilePath, Encoding.ASCII))
{
      using (var streamWriter = new StreamWriter(outputFilePath,false))
      {
           char[] bytes = new char[100];
           while (streamReader.Read(bytes, 0, 100) > 0)
           {
                 var data = new string(bytes);
                 streamWriter.WriteLine(data);
           }
           MessageBox.Show("Compleated");
       }
}

Other than ASCII encoding I have tried UTF-7, UTF-8, UTF-32 and IBM500. But no luck in reading and writing multi language characters.

Please help me to achieve this.

The language doesn't matter (if you really need to count characters, aka symbols). What matter is encoding, how does those special chars are stored. If encoding uses 8-bit to present a character (ASCII), then your approach is ok, because reading 100 bytes equal to reading 100 characters: just add `'\n'` after writing each portion. Otherwise ([variable-lengh encoding](https://en.wikipedia.org/wiki/Variable-width_encoding)) you are in trouble indeed, you must know/[detect encoding](http://stackoverflow.com/q/4520184/1997232). — Sinatr, Aug 05 '16 at 12:20
You're going to have to figure out how the text is encoded. If the file includes Asian and western languages, it's probably Unicode, Big Endian Unicode, or UTF32. Hopefully the file begins with an encoding indicator as shown in Christian Jäger's answer. Or it could be a mix of encodings, in which case you'll have to figure out how the file is structured. It's even possible you'll have to examine the text and deduce the encoding, which won't be trivial at all. — Carey Gregory, Aug 05 '16 at 15:51
One option is to "send it back" if it doesn't come with an encoding per specification, convention or standard. "Detecting" encodings is a measure of last resort. — Tom Blodget, Aug 05 '16 at 16:59
I think this http://stackoverflow.com/questions/7470997/replace-german-characters-umlauts-accents-with-english-equivalents should help — Madhav Shenoy, May 10 '17 at 14:03

score 0 · Answer 1 · answered Aug 05 '16 at 12:07

You will have to take a look at the first 4 bytes of the file you are parsing. these bytes will give you a hint on what encoding you have to use.

Here is a helper method I have written to do the task:

public static string GetStringFromEncodedBytes(this byte[] bytes) {
    var encoding = Encoding.Default;
    var skipBytes = 0;
        if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
            encoding = Encoding.UTF7;
            skipBytes = 3;
        }
        if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf) {
            encoding = Encoding.UTF8;
            skipBytes = 3;
        }

        if (bytes[0] == 0xff && bytes[1] == 0xfe) {
            encoding = Encoding.Unicode;
            skipBytes = 2;
        }

        if (bytes[0] == 0xfe && bytes[1] == 0xff) {
            encoding = Encoding.BigEndianUnicode;
            skipBytes = 2;
        }
        if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff) {
            encoding = Encoding.UTF32;
            skipBytes = 4;
        }


        return encoding.GetString(bytes.Skip(skipBytes).ToArray());
    }

how this will handle char of English Japanese Chinese French Spanish German Italian — slash shogdhe, Aug 05 '16 at 12:09
It will not handle specific chars, it will tell you the encoding of the whole file. If you are having a bytestream of mixed encodings, you will need to check if there is an encoding start in the stream and treat the rest of the stream (until the next encoding starts) with the detected encoding. If the file you are reading is a complete mix of languages without any declaration on the used encoding, I am sorry, I will be of no help — Christian Jäger, Aug 05 '16 at 12:15
Although this will probably be necessary to detect the file's encoding, it's not a full answer since it doesn't show how to read the remainder of the file. That's probably why someone downvoted it. Pretty easy to add an example of how you'd read the whole file with this approach. — Carey Gregory, Aug 05 '16 at 15:42

SlightlyKosumi · Answer 2 · 2016-08-06T16:55:14.727

This is a good enough start to get to the answer. If i is not equal to 100 you need to read more chars. No trouble with french chars like é - they are all handled in C# char class.

char[] soFlow = new char[100];
int posn = 0;
using (StreamReader sr = new StreamReader("a.txt"))
   using (StreamWriter sw = new StreamWriter("b.txt", false))
      while(sr.EndOfStream == false)
      {
          try {
             int i = sr.Read(soFlow, posn%100, 100);
             //if i < 100 need to read again with second char array
             posn += 100;
             sw.WriteLine(new string(soFlow));
          }
          catch(Exception e){Console.WriteLine(e.Message);}
      }

Spec: Read(Char[], Int32, Int32) Reads a specified maximum of characters from the current stream into a buffer, beginning at the specified index.

Certainly worked for me anyway :)

Reading multi language text file in c#

2 Answers2