0

I would like to create a functionality that works similar to SqlDataReader.Read()

I'm reading a flat-file from .txt/.csv and returning it as a datatable to my class handling business logic. This iterates through the rows of the datatable, and transforms the data, writing into a structured database. I use this structure for multiple import sources.

Large files though, work really, really slowly. It is taking me 2h to go through 30 MB of data, and I would like to get this down to 30 min. One step in this direction is to not read the entire file into a DataTable, but handle it line by line, and keep memory from getting klogged.

Something like this would be ideal: PSEUDOCODE.

FlatFileReader ffr = new FlatFileReader(); //Set FlatFileParameters
while(ffr.ReadRow(out DataTable parsedFlatFileRow))
{
     //...Business Logic for handling the parsedFlatFileRow
}

How can I implement a method that works like .ReadRow(out DataTable parsedFlatFileRow) ?


Is this the right direction?

foreach(obj in ff.lazyreading()){
    //Business Logic
} 

...

class FlatFileWrapper{

    public IEnumerable<obj> lazyreading(){
        while(FileReader.ReadLine()){ 
            yield return parsedFileLine; 
        }
    } 
}
Rui Jarimba
  • 11,166
  • 11
  • 56
  • 86
Rafael Emshoff
  • 2,541
  • 1
  • 35
  • 47
  • 1
    FileHelpers may be a good option for you: http://filehelpers.sourceforge.net/ – Rui Jarimba Oct 30 '13 at 09:44
  • 3
    You probably have not determined the cause for the bad performance yet. You gave guessed that it is memory usage, but I'm very suspicious of that. Profile the app, or pause the debugger 10 times to see where it stops most often. – usr Oct 30 '13 at 09:51
  • No you're right - I didn't profile it yet. But memory usage is a known problem, on this SQL Server, and particularly has been noticed in connection with working through large files. So keeping memory use low is a priority in itself. – Rafael Emshoff Oct 30 '13 at 09:55

2 Answers2

1

As Tim already mentioned, File.ReadLines is what you need:

"When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned"

You can create a parser that uses that method, something like this:

// object you want to create from the file lines.
public class Foo
{
    // add properties here....
}

// Parser only responsibility is create the objects.
public class FooParser
{
    public IEnumerable<Foo> ParseFile(string filename)
    {
        if(!File.Exists(filename))
            throw new FileNotFoundException("Could not find file to parse", filename);

        foreach(string line in File.ReadLines(filename))
        {
            Foo foo = CreateFoo(line);

            yield return foo;
        }
    }

    private Foo CreateFoo(string line)
    {
        // parse line/create instance of Foo here

        return new Foo {
            // ......
        };
    }
}

Using the code:

var parser = new FooParser();

foreach (Foo foo in parser.ParseFile(filename))
{
     //...Business Logic for handling the parsedFlatFileRow
}
Rui Jarimba
  • 11,166
  • 11
  • 56
  • 86
  • thx, I think I understood the concept, will try it out and post results. And this allows me to do lazy loading of the lines in the file, with memory only having the data of the current line? – Rafael Emshoff Oct 30 '13 at 10:49
  • Yes, check the link in my answer: **The ReadLines and ReadAllLines methods differ as follows: When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned; when you use ReadAllLines, you must wait for the whole array of strings be returned before you can access the array. Therefore, when you are working with very large files, ReadLines can be more efficient** – Rui Jarimba Oct 30 '13 at 11:02
0

You can use File.ReadLines which works similar to a StreamReader:

foreach(string line in File.ReadLines(path))
{
     //...Business Logic for handling the parsedFlatFileRow
}
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • I'd like to take the file-reading code into its own class. So I'm looking for a wrapper to the code you posted, which returns a Row from the File, and the next time the method is called, returns the next Row, until the entire file has been read... without loading the entire File into memory. – Rafael Emshoff Oct 30 '13 at 09:47
  • @RafaelCichocki: `File.ReadLines` is what you need. From MSDN: **When you use ReadLines, you can start enumerating the collection of strings before the whole collection is returned**. See http://msdn.microsoft.com/en-us/library/dd383503.aspx – Rui Jarimba Oct 30 '13 at 09:51
  • 2
    @RafaelCichocki: You could wrap the importer in a class that wraps a `StreamReader` (remember to implement `IDisposable` to dispose it) and `yield return` to return your objects lazily. http://stackoverflow.com/a/286553/284240 – Tim Schmelter Oct 30 '13 at 09:57
  • @RafaelCichocki This is indeed the method you need. But you probably want to look a C# iterators, i.e. using the `yield return` keyword. – Kris Vandermotten Oct 30 '13 at 10:01
  • 1
    @TimSchmelter: maybe you can edit your answer to include a full method instead of just a foreach loop, using the `yield return` keyword? I think it would be much easier for the OP to understand – Rui Jarimba Oct 30 '13 at 10:09
  • @RuiJarimba: I would, but i don't know his business logic. He wants to create a class that wraps all this functionality and yields instances of a type lazily like a `SqlDataReader`. However, that is diffcult to show without more context. – Tim Schmelter Oct 30 '13 at 10:16
  • @TimSchmelter: that's a bit confusing for me : P. I started looking into `yield return` but I understand it can be used to fill a return IEnumberable object using an iterator. The method still returns 1 object though... if you could post some pseudo code to show the wrapping of StreamReader and lazy reading of the `yield return` that would help : ) ... I think I need something like `foreach( obj in ff.lazyreading()){ } public IEnumerable lazyreading(){ while(FileReader.ReadLine()){ yield return parsedFileLine }}` ?? (I posted this at the bottom of my question) – Rafael Emshoff Oct 30 '13 at 10:17