0

In case of millions of line in a file, which API would be faster and can be parallelised?

File.ReadLines or Streamreader.Readline?

Falco Alexander
  • 3,092
  • 2
  • 20
  • 39
neelesh bodgal
  • 632
  • 5
  • 14
  • 6
    They do the same thing. [`File.ReadLines` uses a `StreamReader` internally](https://referencesource.microsoft.com/#mscorlib/system/io/ReadLinesIterator.cs,e55db6d3fed9e8eb) – canton7 Jun 10 '20 at 12:15
  • So why we have two apis for same thing? – neelesh bodgal Jun 10 '20 at 12:16
  • 4
    it's another level of abstraction – Falco Alexander Jun 10 '20 at 12:17
  • Parallelising either of this apis' would yeild same performance by choosing a strategy like chunk or static partitioner ? – neelesh bodgal Jun 10 '20 at 12:19
  • 2
    `StreamReader` is more powerful, but more verbose to use if you just want to iterate over all lines in a file. `File.ReadLines` just iterates over all lines in a file and none of the other stuff, but can do it in 1 line of code – canton7 Jun 10 '20 at 12:20
  • @canton7 I got it. Thanks for redirection to source code. – neelesh bodgal Jun 10 '20 at 12:22
  • What kind of parallelization do you have in mind? Do you want to have multiple threads reading from the same file concurrently? If yes, are you targeting a specific type of data storage, or you are writing generic code that should run efficiently on any type of hardware? (including hard disks with rotating platters) – Theodor Zoulias Jun 10 '20 at 12:26
  • @TheodorZoulias I have to process website logs, which can be in millions. So reading a line or chunk of lines and dispatch it to a thread(not manually) using parallel class. To achieve this, i thought of comparing the apis before hand. – neelesh bodgal Jun 10 '20 at 12:29
  • 2
    Then you need task-parallelism, not only data-parallelism. One thread for reading data from the filesystem at maximum speed and storing them somewhere in memory, and multiple threads processing these data. Search about the producer-consumer pattern, and the [`BlockingCollection`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.blockingcollection-1) class or the [TPL Dataflow](https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/dataflow-task-parallel-library) library. – Theodor Zoulias Jun 10 '20 at 12:36
  • @TheodorZoulias TPL Dataflow seems to be fitting my case of parallelisation – neelesh bodgal Jun 10 '20 at 12:43
  • Yeah, the TPL Dataflow is an excellent tool for this kind of job, provided that you can afford the small(ish) learning curve. You can see [here](https://stackoverflow.com/questions/58151529/reading-millions-of-small-files-with-c-sharp) an example of using this library, although not for exactly what you are trying to do. – Theodor Zoulias Jun 10 '20 at 12:51

0 Answers0