4

I'm looking to load a 150 MB text file into a string. The file is UTF16 encoded, so it will produce a string that's about 150 MB in memory. All the methods I have tried result in an Out of Memory exception.

I know this is a huge string, and certainly not how I'd like to do things. But there's really not much I can do about that at the moment without a lot of really deep changes to an application about to head out the door. The file does not have an evenly distributed set of lines in it. One line can contain 80% or so of the entire file size.

Here's what I've tried:

Method 1

// Both of these throw Out of Memory exception
var s = File.ReadAllText(path)
var s = File.ReadAllText(path, Encoding.Unicode);

Method 2

var sb = new StringBuilder();

// I've also tried a few other iterations on this with other types of streams
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
  string line;
  while ((line = sr.ReadLine()) != null)
  {
    sb.AppendLine(line);
  }
}

// This throws an exception
sb.ToString();

Method 3

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (StreamReader sr = new StreamReader(fs, Encoding.Unicode))
{
  int initialSize = (int)fs.Length / 2;  // Comes to a value of 73285158 with my test file
  var sb = new StringBuilder(initialSize); // This throws an exception

  string line;
  while ((line = sr.ReadLine()) != null)
  {
    sb.AppendLine(line);
  }

  sb.ToString();
}

So, what can I do to load this file into a string variable?

Edit: Added additional attempts to resolve issue based on comments.

Nathan
  • 1,591
  • 4
  • 17
  • 22
  • [C# MSDN File.ReadAllText Method (string, EWncoding)](https://msdn.microsoft.com/en-us/library/ms143369%28v=vs.110%29.aspx) – MethodMan Feb 05 '15 at 17:45
  • 2
    As additional measure you probably want to force x64 instead of AnyCPU/Prefer x86... Memory fragmentation in 32bit process is common reason to fail large allocations. – Alexei Levenkov Feb 05 '15 at 17:47
  • Unfortunately this has to stay a 32 bit application. – Nathan Feb 05 '15 at 17:55
  • 1
    Cheap way to help http://stackoverflow.com/questions/14186256/net-out-of-memory-exception-used-1-3gb-but-have-16gb-installed I've use it for images, but is better to solve the core issue – R Quijano Feb 05 '15 at 18:12
  • If you're dealing with limited resources, your best approach might be to open a FileStream and just iterate through it with a stream reader. Sort of fits the "how do you eat an elephant? one bite at a time" principle. (Pun intended) – code4life Feb 05 '15 at 18:23

2 Answers2

5

Both of your attempts so far are treating the file as if it were in UTF-8. In the best case, that's going to take twice as much memory - and it's very likely to be invalid data (as UTF-8), basically. You should try specifying the encoding:

var text = File.ReadAllText(path, Encoding.Unicode);

If that doesn't work, you could try a variant on your second code, but specifying the encoding to StreamReader (and probably ignoring the BufferedStream - I don't think it'll help you here), and also specifying an initial capacity for the StringBuilder, equal to half the size of the file.

EDIT: If this line is throwing an exception:

var sb = new StringBuilder(initialSize);

... then you don't have a chance. You are unable to allocate enough contiguous memory.

You may find that you're able to use a List<string> instead:

var lines = File.ReadLines(path).ToList();

... in that at least you've then got lots of little objects. It will take more memory, but it won't require as much contiguous memory. That's assuming you really need the whole file in memory at a time. If you can possibly stream the data instead, that would be a much better bet.

In a small console app I'm able to read a file of the same size with no problem using File.ReadAllText, with both the 32-bit and 64-bit CLR... so it may be a matter of your physical memory and what else you're doing in the program.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Interesting - so are you saying loading a 150 mb string in memory is OK as long as there's 150 megs available in RAM? Is there an upper limit to this - will .NET allow you to store any size in a string as long as there's memory? – Jochen van Wylick Feb 05 '15 at 17:49
  • 1
    @spike: Well, there are a few things to consider. It's likely to *temporarily* take more space than that - if you're unlucky, you might need nearly 3x as much memory, as if you run out of space in the buffer shortly before finishing reading data, it'll allocate a buffer that's twice as big - and then when you call `ToString` it'll copy the relevant part of that buffer into a new string. Then there's a 2GB per-object limit for the 64-bit CLR, or 1GB per-object in the 32-bit CLR if I remember correctly (but the details may be slightly off). – Jon Skeet Feb 05 '15 at 17:51
  • @spike: You should be able to load a 150MB file as a string, but it's not necessarily a good idea... (And use 64-bit CLR if you can, as per Alexei's comment.) – Jon Skeet Feb 05 '15 at 17:51
  • Thanks! Sure, I'll try to avoid it ;) I was just wondering where the limit is and why. – Jochen van Wylick Feb 05 '15 at 17:55
  • @Jon Skeet: I tried both of your suggestions. The ReadAllText version gives me the Out of Memory error. Instantiating the StringBuilder with 1/2 the size of the file also throws one. If I leave off setting the initial capacity, then I get the error where I had in my 2nd method in my post. – Nathan Feb 05 '15 at 17:57
  • @Grandpappy: Okay, will generate my own file and try to reproduce. – Jon Skeet Feb 05 '15 at 18:11
  • @Grandpappy: Have edited my answer. We could really do with a bit more information about the physical system. – Jon Skeet Feb 05 '15 at 18:16
  • My machine has 16GB of ram, and I'm using about 70% of it right now. I've got a 64 bit OS, but I can't guarantee that client will, so the app has to stay 32 bit. I've been testing this in LINQPad, and the code we've been working with is the only code in the application. So I guess it's just having a problem with contiguous memory? – Nathan Feb 05 '15 at 18:52
  • @Grandpappy: Hard to tell - I would definitely try outside LINQPad if possible, as that may be adding more constraints. Do you really *have* to have all of this text in memory in one go? Have you tried using a `List` instead? – Jon Skeet Feb 05 '15 at 19:06
  • Unfortunately I do have to have it all in one string. Currently the rest of the application expects all this to be in a single string, and right now I'm supposed to just deal with this portion of the problem. Until recently this string was much smaller, but recent changes to how a client uses the application has caused this string to grow many times its original size. – Nathan Feb 05 '15 at 19:23
  • I did try this outside of LINQPad, and you were correct, it was causing some of my problems. The original issue stemmed from the real application, but I moved that code into LINQPad to try and test solutions. I'll be more careful of that in the future with large memory problems. – Nathan Feb 05 '15 at 19:43
  • i am reading .sql script file which is 150 MB both [fileInfo.OpenText().ReadToEnd()](http://stackoverflow.com/a/40830/2218697) and [File.ReadAllText](http://stackoverflow.com/a/40830/2218697), crashes visual studio, because file is large, any solution ? – Shaiju T Dec 06 '15 at 13:56
  • @stom: This is the second of my answers that you've commented on about this. Please ask a new question. – Jon Skeet Dec 06 '15 at 14:57
  • sorry for the comments, thank you for the advice, i will keep in mind, i posted the question [here](http://stackoverflow.com/q/34149904/2218697) – Shaiju T Dec 08 '15 at 07:07
0

I am also trying to find a way to load a file into a string with a minimum memory usage while the loading. But all the methods I see have StringBuilder under the hood: first it collects everything in the StringBuilder, and then calls StringBuilder.ToString(), so in the process we have all the same characters repeated twice.

That means, for UTF16 files - peak memory usage = (file size) * 2 bytes, for UTF8 files (assuming all the text is ASCII) peak memory usage = (file size) * 4 bytes. In the end we'll have (string length) * 2 bytes of memory used of course.

  • StreamReader.ReadToEnd() - uses StringBuilder
  • File.ReadAllText() - uses StreamReader.ReadToEnd() - uses StringBuilder
nzeemin
  • 901
  • 8
  • 17