Efficient processing of sequential files C#

Question

I am developing a system that processes sequential files generated by Cobol systems, currently, I am doing the data processing using several substrings to get the data, but I wonder if there is a more efficient way to process the file than to make several substrings...

At now, I do basically:

using (var sr = new StreamReader("file.txt"))
{
    String line = "";
    while(!sr.EndOfStream)
    {
        line = sr.ReadLine();
        switch(line[0])
        {
            case '0':
                processType0(line);
                break;
            case '1':
                processType1(line);
                break;
            case '2':
                processType2(line);
                break;
            case '9':
                processType9(line);
                break;
        }
    }
}

private void processType0(string line)
{
    type = line.Substring(0, 15);
    name = line.Substring(15, 30);
    //... and more 20 substrings
}

private void processType1(string line)
{
    // 45 substrings...
}

The file size may vary between 50mb and 150mb... A small example of the file:

01ARQUIVO01CIVDSUQK       00000000000000999999NAME NAME NAME NAME           892DATAFILE       200616        KY0000853                                                                                                                                                                                                                                                                                     000001
1000000000000000000000000999904202589ESMSS59365        00000010000000000000026171900000000002            0  01000000000001071600000099740150000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-AAAAAAAAAAAAAAAAAAAAAAAAA                                                            00000000                                                            000002
1000000000000000000000000861504202589ENJNS63198        00000010000000000000036171300000000002            0  01000000000001071600000081362920000000001N020516000000000000000000000000000000000000000000000000000000000000009800000000000000909999-BBBBBBBBBBBBBBBBBBBBBBBBBB                                                           00000000                                                            000003
9                                                                                                                                                                                                                                                                                                                                                                                                         000004

Efficient? As in the code runs faster? Or the actual process of writing the code is more efficient? — Kaizen Programmer, Jun 20 '16 at 14:31
Haven't tried this myself, but try this http://stackoverflow.com/a/20803/1105235 — rpeshkov, Jun 20 '16 at 14:33
A regular expression will be a *lot* faster than manual splitting because it doesn't generate any temporary strings until you actually extract the matches you want. This is a huge benefit when parsing large files because it reduces allocations and garbage collections dramatically. You can also assign names to specific groups, eg `"(?.{15})(?.{14})` etc. — Panagiotis Kanavos, Jun 20 '16 at 14:36
As I can see, the file contains good number of spacing. Why don't you split a line by space like line.Split(" "). It will give you a array of substring, which you can easily process. The process you are using now can't be used for any string size. — Md. Tazbir Ur Rahman Bhuiyan, Jun 20 '16 at 14:40
@TazbirBhuiyan any string manipulation generates unnecessary temporary strings. Besides, in fixed-width formats whitespace *is* significant — Panagiotis Kanavos, Jun 20 '16 at 14:41
Thanks for all!, I found this article, http://www.codeproject.com/Articles/10750/Fast-Binary-File-Reading-with-C, but, is all for binary file... @rpeshkov I tried to use that method... but, not worked for stream file... it's work just for binary file... i got an exception... — Alexandre, Jun 20 '16 at 14:57
@Alexandre are you looking for efficient code performance, or efficient code writing process here? Or both? — Kaizen Programmer, Jun 20 '16 at 15:30
@Michael_B, I'm looking for a efficiente code performance! :) — Alexandre, Jun 20 '16 at 15:59
Your records look to be fixed-length. Presumably C# has some type of "structure" which maps data? Search-engine seems to think so. — Bill Woodger, Jun 20 '16 at 16:03
How big an issue is your problem? I found an old 276mb mailbox, ran a little awk on it, doing 276m one-byte substrings and it completed in 193 seconds on an ageing mobile i7. (note that awk is doing substring processing behind the scenes as well). If you had 40 fields on each record you're only at about 15m language-substrings for your largest file. Is it too slow for you? What are your timings, and what do you need to get it to? — Bill Woodger, Jun 22 '16 at 21:21

score 2 · Answer 1 · 2016-06-20T15:33:21.390

2

Frequent disk reads will slow down your code.

According to MSDN, the buffer size for the constructor you are using is 1024 bytes. Set a larger buffer size using a different constructor:

int bufferSize = 1024 * 128;

using (var reader = new StreamReader(path, encoding, autoDetectEncoding, bufferSize))
{
   ...
}

C# prioritizes safety over speed, so all String functions generate a new string.

Do you really need all of those substrings? If not, then just generate the ones you need:

private static string GetType(string line)
{
    return line.Substring(0, 15);
}

if (needed)
    type = GetLine(line);

edited Jun 20 '16 at 15:33

answered Jun 20 '16 at 14:57

In my experience this usually makes very little difference as the disk subsystem is normally fairly well buffered before the data even gets to the stream reader code. But it is certainly worth a try. – Martin Brown Jun 21 '16 at 08:34

score 1 · Answer 2 · answered Jun 20 '16 at 15:21

1

You could try writing a parser which processes the file one character at a time.

I read a good article titled 'Writing a parser for CSV data' on how to do this with CSV files the other day, though the principals are the same for most file types. This can be found here http://www.boyet.com/articles/csvparser.html

answered Jun 20 '16 at 15:21

Martin Brown

24,692
14
77
122

4

The fields are fixed starting positions, and fixed width. Where does parsing of a CSV come into it? – Bill Woodger Jun 20 '16 at 15:57
CSV is not the important thing here, it is the use of a Parser that is the important thing. The article I referenced just happens to use CSV as an example to demonstrate the principals of parsing. – Martin Brown Jun 20 '16 at 16:21
You mean "parse sourcefield namea (length) nameb (length) namec (length) with some optional displacements/offsets? At then end that is the same as has been started out with, just a one-(long)-liner. I'm really not sure where you are going with this, but you have adherents :-) – Bill Woodger Jun 20 '16 at 19:15
The alternative being presented is to copy a row from the stream to new string, copy field 1 to new string, copy field 2 to new string, copy field 3 to new string, etc, truncate white space on filed 1 copying to new string as we go, truncate white space on field 2 copying to a new string as we go, truncate white space on field 3 copying to new string as we go. That is a lot of memory allocations and string copy operations. I admit however, that the performance gain by using a parser type structure is likely to be small. – Martin Brown Jun 21 '16 at 08:26

score 1 · Answer 3 · answered Jun 26 '16 at 21:54

First time with C# but I think you want to look at something like

struct typeOne {
    fixed byte recordType[1];
    fixed byte whatThisFieldIsCalled[10];
    fixed byte someOtherFieldName[5];
    ...
}

And then just assign different structs by line[0] case. Or, knowing next to nada about C# that could be in the completely wrong ballpark and end up being a poor performer internally.

score 0 · Answer 4 · answered Jun 20 '16 at 14:46

0

I love Linq

IEnumerable<string> ReadFile(string path)
{
 using (var reader = new StreamReader(path))
  {
    while (!reader.EndOfStream)
    {
     yield return reader.ReadLine();
    }
  }
}


void DoThing() 
{
  var myMethods = new Action<string>[] 
    { 
      s => 
         {
           //Process 0            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
      s => 
         {
           //Process 1            
           type = line.Substring(0, 15);
           name = line.Substring(15, 30);
           //... and more 20 substrings
         },
            //...

    }

var actions = ReadFile(@"c:\path\to\file.txt")
      .Select(line => new Action( () => myMethods[int.Parse(line[0])]() ))
      .ToArray();

 actions.ForEach(a => a.Invoke());
}

answered Jun 20 '16 at 14:46

fahadash

3,133
1
30
59

This won't imporve performance at all. In any case, [File.ReadLines](https://msdn.microsoft.com/en-us/library/system.io.file.readlines(v=vs.110).aspx) does the same as ReadFIle here – Panagiotis Kanavos Jun 20 '16 at 15:24
1

@PanagiotisKanavos Is the asker looking for the algo-optimization only? – fahadash Jun 20 '16 at 15:28
When reading 150MB of data, the question isn't about algorithms at all, it's about speed and memory. BTW an array/dictionary of Regex objects indexed by the first character would really help. In any case though, the OP is asking about COBOL file parsing. I suspect there is at least one library for this – Panagiotis Kanavos Jun 20 '16 at 15:33
@PanagiotisKanavos I agree with you. If you think this answer adds any value or might be beneficial for future readers without confusing them. I will keep it. Otherwise I am willing to delete it. Your call – fahadash Jun 20 '16 at 16:34

Efficient processing of sequential files C#

4 Answers4