-2

In .Net core, I have huge text files that need to be converted from Unix to Windows.

Since I can't load the file completly in memory (the files are too big), I read each byte one after the other, and when I encounter a LF, I output a LF+CR. This process works, but it takes a long time for huge files. Is there a more efficently way to do?

I thought about using a StreamReader, but the problem I'm having is that we don't know the source file encoding.

Any idea?

Thank you

user1861857
  • 198
  • 1
  • 8
  • hard to say how to increase the efficiency without seeing how you're reading the bytes. Most efficient is likely reading the file into a buffer of some reasonable size and writing directly from that buffer to the new file - modifying whatever you need along the way. You might get better with a memory mapped file and spans but I don't know if the increase would be meaningful in your context. But I also don't know how you're modifying content without knowing the encoding. Seems impossible. – MikeJ Aug 06 '20 at 16:08

1 Answers1

0

Without knowing more about the specific files you're trying to process, I'd probably start off with something like the below and see if that gets me the results I want.

Depending on the specifics of your situation you may be able to do something more efficient, but if you're handling truly large datasets with unstructured text then it's usually a matter of throwing more powerful hardware at the problem if speed is still an issue.

You don't have to specify the Encoding to make use of the StreamReader class. Was there a specific problem with the reader you encountered?

const string inputFilePath = "";
const string outputFilePath = "";

using var sr = new StreamReader(inputFilePath);
using var sw = new StreamWriter(outputFilePath);

string line;

// Buffers each line into memory, but not the newline characters.
while ((line = await sr.ReadLineAsync()) != null)
{
    // Write the contents of the string out to the "fixed" file (manually
    // specifying the line ending you want).
    await sw.WriteAsync(line + "\r\n");
}
Michael
  • 122
  • 1
  • 2
  • 9
  • The problem I'm having with this is the StreamWriter. If I don't specify an output encoding, the StreamWriter uses UTF8 by default. The thing is, the input file could have another encoding... – user1861857 Aug 06 '20 at 15:42
  • 1
    OK I see what you mean now. So there's two options here. If you *have* to preserve the encoding of the original file then you'll find a good discussion on how to get the encoding from any file [in this answer](https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding). You can then set this value as one of the constructor parameters on the StreamWriter class. But I would also consider if this is necessary? Does using UTF8 cause you a problem later on? – Michael Aug 06 '20 at 15:50