11

I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time.

using (StreamReader reader = new StreamReader(_inputFileName))
{
   using (StreamWriter writer = new StreamWriter(_outputFileName))
   {
     string line;
     while ((line = reader.ReadLine()) != null)
     {
       if (!_lstLineToRemove.Contains(line))
              writer.WriteLine(line);
     }

    }
  }

How can I enhance the performance of my code?

James Z
  • 12,209
  • 10
  • 24
  • 44
Rocky
  • 4,454
  • 14
  • 64
  • 119
  • have you tried appending to file ? http://stackoverflow.com/questions/7306214/append-lines-to-a-file-using-a-streamwriter – Igoris Azanovas Apr 20 '16 at 12:48
  • 13
    One simple way would be to convert `_lstLineToRemove` from `List` to `HashSet` (assuming it's not hash set already) – ghord Apr 20 '16 at 12:48
  • What do you mean by *long long time* ? Is there a lot of short line or a few long lines ? – Thomas Ayoub Apr 20 '16 at 12:49
  • long time means more than 15 min. row will be something like this A21933,67,RN3W,2007,12,10Ø00.000 ,0.000 ,0.000 00000,0000#000876053,67,JITK,2007,12,1000.000 ,0.000 ,0.000 – Rocky Apr 20 '16 at 12:50
  • This may help. It talks about using BufferedStream: http://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp – sr28 Apr 20 '16 at 12:52
  • And how many lines are in `lstLineToRemove` ? – Thomas Ayoub Apr 20 '16 at 12:52
  • Better move this task to a background worker and leave the user interface free to respond to user commands. Of course you should check if your copy has ended before closing your application – Steve Apr 20 '16 at 12:52
  • @Thomas: it could be 1000 or 100000 or 10 – Rocky Apr 20 '16 at 12:52
  • Define efficient. And first get rid of that indian "lac" - name it in english if you speak english. Now, performance - you talk of a 2gb operation here. On a hard disc that will take time. On a normal SSD that will take time. So, define "along long". Hours? Days? – TomTom Apr 20 '16 at 12:54
  • @TomTom - the OP has defined "a long long time" in the comments above. It's more than 15mins. – sr28 Apr 20 '16 at 12:57
  • you could try using blocking collection and implement producer | consumer pattern. – TLJ Apr 20 '16 at 12:58
  • I bet line order is important ? – Thomas Ayoub Apr 20 '16 at 12:58
  • More on BufferedStream: https://msdn.microsoft.com/en-us/library/system.io.bufferedstream(v=vs.110).aspx – sr28 Apr 20 '16 at 12:59
  • That does not sound so bad for using no buffers and likely a laptop or something like that. – TomTom Apr 20 '16 at 13:03
  • Seems to be an I/O problem. Have you tried it on a SSD or RAM-Disk? Have you ran a performance analysis? – Oliver Apr 20 '16 at 13:07
  • 1
    The speed is roughly 1Mb/s, this seems pretty slow. From where you read and to where you write? HDD? SSD? Flash? When reading and writing from same physical drive speed is reduced. What if you remove check and let it write all lines? How fast it will be? If it's same 15min, then bottleneck is file system. If severely less, then there is a way to optimize algorithm. – Sinatr Apr 20 '16 at 13:08
  • 1
    Another guess would be to replace your `List` with a `HashSet`. – Oliver Apr 20 '16 at 13:09
  • 1
    You won't really be able to speed up the IO beyond what you have. But as others have pointed out, replacing `List` with `HashSet` is likely to help a lot, especially if `_lstLineToRemove` contains hundreds of lines. – Matthew Watson Apr 20 '16 at 13:12
  • @Rocky, could you run .Net Profiler and provides us with results. – Sergey L Apr 20 '16 at 13:18
  • is it a 32bits or 64bits application? how much ram the computer that will run this have? – Fredou Apr 20 '16 at 13:27
  • 2
    Question: How long does it take to simply COPY this file? you cannot get any faster than this benchmark, so please note that first. Then, tell me, what is the time your application takes? – DaFi4 Apr 20 '16 at 14:07
  • This sounds like log processing. The TPL Dataflow can really help in such situations. You could use an ActionBlock to decouple reading from writing and move move writing to a different task. Or you can use a BatchBlock first to batch multiple lines, then use an ActionBlock to write multiple lines at once. – Panagiotis Kanavos Apr 20 '16 at 14:33
  • The greatest performance benefit would be to move the target file to a *different* drive. – Panagiotis Kanavos Apr 20 '16 at 14:33
  • @Panagiotis only under certain conditions on certain computers, its not a law. 1st determine problem, then suggest solution. I can think of many ways where your suggestions slow down things – DaFi4 Apr 20 '16 at 14:34
  • @montewhizdoh which one? Using two different spindles for the files? Actually this is kind of a law. Using the same spindle, you compete for reads and writes. Using different disks, you can read and write at each disk's full speed – Panagiotis Kanavos Apr 20 '16 at 14:41
  • @montewhizdoh as for using a different task to write, IO isn't a full-time operation due to OS-level buffering. Even for IO-bound tasks, there are breaks where you can perform different operations. That's why eg processing multiple image files in parallel is faster than processing them sequentially – Panagiotis Kanavos Apr 20 '16 at 14:43
  • @Panagiotis your suggestion is to move the file to a different drive. So if we take the file from "Drive A" to another drive, that is just vague. What if the other drive is a network share or seriously slower than the 1st drive? Or on a slow bus? Why would I want to use spindles? I prefer solid state drives. – DaFi4 Apr 20 '16 at 14:55
  • @Panagiotis as for your task suggestion, if a computer does not have extra processor capacity, or if the files are of the correct size, your code will run slower. Thats why its better to find the problem first before the suggestion. – DaFi4 Apr 20 '16 at 14:59
  • @montewhizdoh can you provide an example? As for processor capacity, even a single-core machine can read and write at the same time. File size would matter only for very small files. Here we know it's a 1 GB file that's processed at only 1MB/sec. For larger files, using only a specific amount of memory rather than trying to load everything at once, would be faster. – Panagiotis Kanavos Apr 20 '16 at 15:05
  • In fact, you could try this with `robocopy` which *does* allow multi-threaded file copying. – Panagiotis Kanavos Apr 20 '16 at 15:06
  • @Panagiotis I'm trying to help out here, please dont get upset. there is significant overhead in instantiating parallel operations and it is very noticeable when dealing with large files like this. Its noticeably faster in some cases because you get something for the investment, but If you instantiate a parallel operations when you don't have any processor available, its noticeably SLOWER as you just waste precious processor time without any benefit. my point is still that you should understand the problem before making a suggestion. – DaFi4 Apr 20 '16 at 15:13

4 Answers4

4

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations.

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed.

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on.

Antony
  • 1,221
  • 1
  • 11
  • 18
Scott Chamberlain
  • 124,994
  • 33
  • 282
  • 431
  • This is no different than the other answer so the same comment applies: Loading the data is an IO operation with enough lag to allow filtering while reading each line. Loading everything in memory will only slow filtering dramatically and increase CPU usage due to looping. Log processing code does not load everything in memory – Panagiotis Kanavos Apr 20 '16 at 14:26
  • @PanagiotisKanavos This does not load everything in to memory, I am using `ReadLines` not `ReadAllLines` like the other answer so the processing is streamed instead of loaded in to memory all at once. Also I did post a update as you where typing that does address the I/O – Scott Chamberlain Apr 20 '16 at 14:27
  • It should be `=> !_hshLineToRemove.Contains(...` – CSharpie Apr 20 '16 at 14:29
  • I suspect *ReadLines* has a greater effect than PLINQ. In fact, I'd use TPL Dataflow to split reading from writing. I'd also move the target file to a *different* drive. I suspect this would improve performance more than multithreading. Only if that wasn't enough, I'd add a filtering block – Panagiotis Kanavos Apr 20 '16 at 14:34
  • @PanagiotisKanavos I originally was going to write a dataflow answer but thought it was too overkill for this problem. However if you wrote up a Dataflow implementation of this I would upvote it. – Scott Chamberlain Apr 20 '16 at 14:39
  • @ScottChamberlain 1 GB log file? It's probably the sweet spot. Eg, if the reader is too fast, it could flood the machine's memory. Throttling can help here. To reduce the number of writes, you can use BatchBlock. – Panagiotis Kanavos Apr 20 '16 at 14:40
  • it much much better in performance.. Thanks.. thanks a lot :) – Rocky Apr 22 '16 at 06:53
0

The code is particularly slow because the reader and writer never execute in parallel. Each has to wait for the other.

You can almost double the speed of file operations like this by having a reader thread and a writer thread. Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory.

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too.

Ian Mercer
  • 38,490
  • 8
  • 97
  • 133
  • That's already available out-of-the-box with Dataflow's ActionBlock. – Panagiotis Kanavos Apr 20 '16 at 14:36
  • 1
    For new development I honestly don't use `BlockingCollection` pipelines anymore, I have switched over to [TPL Dataflow](https://msdn.microsoft.com/en-us/library/hh228603(v=vs.110).aspx), it gives you the same process as a manual `BlockingCollection` based pipeline with separate `Task` for each stage of the work but it wraps it up it in to nice container classes so you don't need to deal with the collections or starting up the tasks. – Scott Chamberlain Apr 20 '16 at 14:36
  • @ScottChamberlain *and* it allows to to throttle the reader so it doesn't fill the buffer with all lines, if the writer is slow – Panagiotis Kanavos Apr 20 '16 at 14:37
0

Do not use buffered text routines. Use binary, unbuffered library routines and make your buffer size as big as possible. That's how to make it the fastest.

Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
0

Have you considered using AWK

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK

MSE
  • 327
  • 3
  • 8