0

I have a really big CSV file with about 1,000,000 rows and it takes about 500 MB of memory. I don't have to read all the file. I want read every one hundredth line from the file. I try to do it by ReadLines, but it is really slow, faster is ReadAllLines.

My code:

for (int i = 0; i < 10000; i++)
{
   tableOfString[i]=File.ReadLines("TestCSV.csv").Skip(i*100).Take(1).First();
   //or
   tableOfString[i] = File.ReadLines("TestCSV.csv").ElementAtOrDefault(i*100);
}

I read about some readers:

Has anybody got a solution? I want to read only certain lines from the CSV, not the whole file.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • Store all in a string[] instead of using `File.ReadLines`. You can use `File.ReadAllLines`. `File.ReadLines` returns an IQueryable, so always when you access it you'll query all lines. – Tim Schmelter Oct 28 '14 at 09:28
  • 1
    Are you saying you don't want to read the whole file into memory? – DavidG Oct 28 '14 at 09:29
  • `File.ReadLines("TestCSV.csv").ElementAtOrDefault(i * 100);` – artm Oct 28 '14 at 09:35

3 Answers3

2

ReadLines is not slow. The problem is that you're re-reading the file upto the desired row in each iteration. (When i=1, you read lines 0-100... when i=2, you read lines 0-200, etc.)

You should avoid calling File.ReadLines multiple times. In other words, only open the file once and filter out the lines that you don't want using Where. So try this instead:

var filteredLines = File.ReadLines("TestCSV.csv")
    .Select((Text, Index) => new {Text, Index})
    .Where(x => x.Index % 100 == 0);

foreach(var line in filteredLines)
{
    tableOfString[line.Index] = line.Text;
}  

Not sure if you how you're creating or using that tableOfString, but if it is solely used to get these lines then you can directly convert your linq query to an array (you don't have to populate the array in a for-loop):

 var tableOfString = File.ReadLines("TestCSV.csv")
    .Where((x, i) => i % 100 == 0)
    .ToArray();
Eren Ersönmez
  • 38,383
  • 7
  • 71
  • 92
  • It is good solution, but it still iterate all row in file and take only every one hundredth which it save RAM. :) Is it possible to not iterate everything and take only these chosen lines? It will be faster. Now I have 500,000 lines in file and I take only 1000 lines, but it still take 2 seconds to read it. – user3447900 Oct 28 '14 at 10:41
1

Accordding to your code you want to get

0th, 100th, 200th ... 1000000th lines of the CSV file and store them in tableOfString[]

You can do it like that:

  tableOfString = File
    .ReadLines("TestCSV.csv")
    .Where((line, index) => (index % 100) == 0)
    .ToArray();

Re-opening file (which is slow) in a loop as you do means a great overhead

Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
0

At first, if you don't want to load the complete file in to the memory, than File.ReadLines and File.ReadAllLines did not work.

If you want to read only a few bytes of the file into the RAM, i would recommend that you use File.OpenRead and than read the parts you need into a buffer. Like How can I read/stream a file without loading the entire file into memory?.

But than you have the problem, that you can not skip 99 lines and only read every 100 line. If you want to implement this, you need to know the size of every line to set your offset in the Read-Method.

The eaysiest version is to work with File.ReadAllLine and than iterate over the string-array or use Linq.

Community
  • 1
  • 1
BendEg
  • 20,098
  • 17
  • 57
  • 131