0

Have some problem. Have large txt file in ANsi.

Read it line by line with such function:

private static IEnumerable<string> ReadLineFromFile(TextReader fileReader)
{
    using (fileReader)
    {
        string currentLine;
        while ((currentLine = fileReader.ReadLine()) != null)
        {
            yield return currentLine;
        }
    }
}


public void go()
{
    while (true)
    {
        TextReader readFile = new StreamReader(file_path);
        foreach (string line in ReadLineFromFile(readFile))
        {
        }
    }
}

How to convert all ANSI lines to UTF-8? Thanks

Chris Catignani
  • 5,040
  • 16
  • 42
  • 49
obdgy
  • 449
  • 2
  • 8
  • 11

3 Answers3

0

Try using Encoding.UTF8.GetBytes() (in the System.Text namespace) to get the bytes that correspond to a UTF-8 string. In .NET, all strings are internally Unicode, so there's no such thing as a UTF-8 string at runtime. Rather, when encoding strings in different formats, you'll use methods like GetBytes() to get the bytes that represent the encoded string in memory.

EDIT: Some links:

Michael Gunter
  • 12,528
  • 1
  • 24
  • 58
0

First you need to read the bytes from the file, then use Encoding.GetEncoding(1252); to get the ANSI encoding (the code page may vary), then you can use GetString to get an internal .net-string or convert it to another encoding.

Try something like this:

private IEnumerable<string> ReadLineFromFile(string path)
{
    byte[] ansiEncodedBytes = File.ReadAllBytes(path);
    Encoding ansi = Encoding.GetEncoding(1252);
    string utf16string = ansi.GetString(ansiEncodedBytes);
    return utf16string.Split("\n");
}
Lorentz Vedeler
  • 5,101
  • 2
  • 29
  • 40
0

If you are using .Net 4 or later, you can use the File.ReadLines(string path, Encoding encoding) method.

This reads the file line-by-line like your ReadLineFromFile() method, and the Encoding parameter will allow you to specify Encoding.Default. This will tell it to use the operating system's current ANSI code page when reading the text.

Note that the strings will be converted from ANSI to UTF16, because UTF16 is the type used for string in C#.

So you could rewrite your go() test method like so:

using System.IO;
using System.Text;

...

public void go()
{
    while (true)
    {
        foreach (string line in File.ReadLines(file_path, Encoding.Default))
        {
        }
    }
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276