I have a very basic question. I have several text files with data which are several GB's in size each. I have a C# WPF application which I'm using to process similar data files but nowhere close to that size (probably around 200-300mb right now). How can I efficiently read this data and then write it somewhere else after processing without everything freezing and crashing? Essentially whats the best way to read from a very large file? For my low scale application right now, I use System.IO.File.ReadAllLines
to read and a streamwriter
to write. I'm sure those 2 methods are not the best idea for such large files. I don't have much experience with C#, any help will be appreciated!

- 3,684
- 5
- 32
- 63
-
Use [Streams](http://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp) – cbr Jun 16 '15 at 17:03
-
1Consider using [multithreading](https://msdn.microsoft.com/en-us/library/ck8bc5c6.aspx) for any kind of sizable background task. That way you could read a 100GB file with a basic reader without it freezing your program. – FreshWaterTaffy Jun 16 '15 at 17:07
-
You are currently using 'ReadAllLines'. Does your file processing logic require you to look at all the lines "at the same time" or can you process line by line? – Mike Goodwin Jun 16 '15 at 17:07
-
@Icemanind Okay well that was just an example, the point is if your doing something that takes a significant amount time, it may be worth creating a background task to handle so your application doesn't freeze. – FreshWaterTaffy Jun 16 '15 at 17:10
-
@MikeGoodwin I can read line by line. – sparta93 Jun 16 '15 at 17:12
-
The question is process line by line. Not read line by line. Once processed can you output line by line? – paparazzo Jun 16 '15 at 17:16
-
@Blam Yes, can output line by line. – sparta93 Jun 16 '15 at 17:17
-
Consult [Extremely Large Single-Line File Parse](http://stackoverflow.com/questions/26247952/). – Dour High Arch Jun 16 '15 at 17:37
2 Answers
If you can do this line by line then the answer is simple:
- Read a line.
- Process the line.
- Write the line.
If you want it to go a bit faster, put those in three BlockingCollections
with a specified upper bound of something like 10, so a slower step is never waiting on a faster step. If you can output to a different physical disc (if output is to disc).
OP changed the rules even after being asked if the process was line by line (twice).
- Read line(s) to generate unit of work (open to close tags).
- Process unit of work.
- Write unit of work.
-
In the input txt file, my program looks for
and – sparta93 Jun 16 '15 at 17:21tags while reading. Sometimes its all in the same line and that's that, but sometimes they do span multiple lines. sorry if I confused you. So yea input cannot be read line to line exactly, but output has to be line by line. -
I specifically asked if process by line. Then just change read line to read lines for a unit of work that you send to process the unit of work. – paparazzo Jun 16 '15 at 17:25
-
@sparta93 just implement a state machine, and you *still* can do it line by line (on-the-fly parsing). If you're reading XML, take a look at SAX. – Lucas Trzesniewski Jun 16 '15 at 17:30
This might be an overlapped transformation of some kind.
https://msdn.microsoft.com/en-us/library/dd997372(v=vs.110).aspx
First, you'll want to allocate your destination file to as close to the result size as estimable. Overshooting may be preferable to undershooting in most situations, you can always truncate to a given length, but growth may require non-contiguous allocation. If excessive growth is expected, you may be able to allocate the file as a "sparse" file.
Pick an arbitrary (maybe binary power) block size (test to find best performance) greater than or equal to 512 bytes.
Map 2 blocks of the source file. This is your source buffer.
Map 2 blocks of the destination file. This is your destination buffer.
Operate on the lines within a block. Read from your source block, write to your destination block.
Once you transition a block boundary, perform a "buffer swap" to trade the previous completed block for the next block.
There are several ways to accomplish these tasks.
If you wish, you may allocate more blocks at a time for the operation, though you'll need to apply a "triple buffering" strategy of overlapped operation to utilize. If write is far slower than read, you may even implement unbounded memory buffering with the same pattern as triple buffering.
Depending on your data, you may also be able to distribute blocks to separate threads, even if it's a "line based" file.
If every line is dependent on previous data, there may be no way to accelerate the operation. If not, indexing the lines in the file prior to performing operations will allow multiple worker threads, each operating on independent blocks.
If I need to elaborate on anything, just say what part.

- 3,737
- 16
- 29