6

I am trying to read in a table from an Access database, and then sort the data in that table out to a number of text files. The clincher is that the filename to write to is dependant upon values in each record. This is officially my first C# application so you can consider me "green". I should also mention that I'm working off an Access database just until I can get the code hammered out, ultimately it'll be pulling from a SQL server with millions of records.

I have code working now, but the problem is that there are a ton of File Open / Close operations. I want to only open each file once for writing since it will be writing these files to a network drive. This is essentially a glue app running on a server - so there are some other restrictions too - I can't save to a local drive then copy to the network. I can't sort the query prior to pulling. I can't adversely affect server resources while running.

Probably the best way to do this is with a Hash table. Check if the file has been opened, if not, open it and save the file handle in the Hash Table. Then close them all at once when finished. However I cannot find an example of how to use multiple StreamWriter objects simultaneously.

I expected to find the answer to this relatively easily but I can't seem to find a solution tot his. My suspicion is that StreamWriter is the wrong class to be using for this.

The closest previous question I've been able to find is from a CodeProject page. On that page they say that the practice of keeping file hands open is bad and should be avoided, but the page doesn't explain why nor offer example alternatives. There is a suggestion to load the entire data set into memory and then operate on it, but that's not an option for me as there will be too much data in the tables.

Here's what I have so far.

String strConnection;
String strQuery;
String strPunchFileNameTemplate;

// Define our Variables
strConnection = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=ClockData.accdb";
strQuery = @"SELECT * FROM ClockPunches";   
strPunchFileNameTemplate = @"C:\PUNCHES\%READER%.TXT";      

// OleDbConnection implements iDisposable interface, so we must scope out its usage.
// Set up Connection to our data source
using (OleDbConnection ConnObj = new OleDbConnection(strConnection))    {

    // Create a Command with our Query String
    OleDbCommand CmdObj = new OleDbCommand(strQuery,ConnObj);

    // Open our Connection
    ConnObj.Open();

    // OledbDataReader implements iDisposable interface, so we must scope out its usage.
    // Execute our Reader
    using (OleDbDataReader ReaderObj = CmdObj.ExecuteReader(CommandBehavior.KeyInfo))   {

        // Load the source table's schema into memory (a DataTable object)
        DataTable TableObj = ReaderObj.GetSchemaTable();

        // Parse through each record in the Reader Object
        while(ReaderObj.Read()) {

            // Extract PunchTime, CardNumber, and Device to separate variables
            DateTime dtTime = ReaderObj.GetDateTime(ReaderObj.GetOrdinal("PunchTime"));
            Int16 intID = ReaderObj.GetInt16(ReaderObj.GetOrdinal("CardNumber"));
            String strReader = ReaderObj.GetString(ReaderObj.GetOrdinal("Device"));

            // Translate the device name into a designated filename (external function)
            strReader = GetDeviceFileName(strReader);

            // Put our dynamic filename into the path template
            String pathStr = strPunchFileNameTemplate.Replace("%READER%",strReader);

            // Check to see if the file exists.  New files need an import Header
            Boolean FileExistedBool = File.Exists(pathStr);

            // StreamWrite implements iDisposable interface, so we must scope out its usage.
            // Create a Text File for each Device, Append if it exists
            using (StreamWriter outSR = new StreamWriter(pathStr, true))    {

                // Write our Header if required
                if (FileExistedBool == false)   {
                    outSR.WriteLine("EXAMPLE FILE HEADER");
                }

                // Set up our string we wish to write to the file
                String outputStr = dtTime.ToString("MM-dd-yyyy HH:mm:ss") + " " + intID.ToString("000000");

                // Write the String
                outSR.WriteLine(outputStr);

                // End of StreamWriter Scope - should automatically close
            }
        }
        // End of OleDbDataREader Scope - should automatically close
    }
    // End of OleDbConnection Scope - should automatically close
}
Daniel Ward
  • 103
  • 1
  • 1
  • 5
  • What is your primary / unique key? can you append that key into your text filename so that it will not cause any contention issues as it has the unique key appended to the filename? – PM_ME_YOUR_CODE Mar 19 '13 at 20:27
  • The key is already in the filename. There is only one file generated per reader device. – Daniel Ward Mar 19 '13 at 21:20

4 Answers4

2

It is a quite interesting problem you got your self into.

The problem with caching the file handlers is that a huge amount of file handlers can drain the system for resources making the program and windows perform bad.

If the number of devices in your database is not too high (less than 100) i think it would be safe to cache the handles.

Alternatively you could cache a million records distribute them to the different devices and save some and then read some more records.

You could place the records in a Dictionary like this:

class PunchInfo
{  
    public PunchInfo(DateTime time, int id)
    {
        Id = id;
        Time = time;
    }
    public DateTime Time;
    public int Id;
}

Dictionary<string, List<PunchInfo>> Devices;
int Count = 0;
const int Limit = 1000000;
const int LowerLimit = 90 * Limit / 100;
void SaveRecord(string device, int id, DateTime time)
{
   PunchInfo info = new PunchInfo(time, id);
   List<PunchInfo> list;
   if (!Devices.TryGetValue(device, out list))
   {
      list = new List<PunchInfo>();
      Devices.Add(device, list);
   }
   list.Add(info);
   Count++;
   if (Count >= Limit)
   {
       List<string> writeDevices = new List<string>();
       foreach(KeyValuePair<string, List<PunchInfo>> item in Devices)
       {
           writeDevices.Add(item.Key);
           Count -= item.Value.Count;
           if (Count < LowerLimit) break;
       }

       foreach(string device in writeDevices)
       {
          List<PunchInfo> list = Devices[device];
          Devices.Remove(device);
          SaveDevices(device, list);
       }
    }
}

void SaveAllDevices()
{
    foreach(KeyValuePair<string, List<PunchInfo>> item in Devices)
        SaveDevices(item.Key, item.Value);
    Devices.Clear();
}

This way you will avoid opening and closing files and have a lot of open files.

One million records takes up 20 MB memory, you could easily raise that to 10 million records without problems.

Casperah
  • 4,504
  • 1
  • 19
  • 13
  • "huge amount of file handlers can drain the system". that is the rule we learn, but measure it - this rule has gone. or other way: "huge" is now very large, like n * 10000. – citykid Mar 19 '13 at 20:43
  • Even though windows is able to handle 10000 open files it is still a bad idea if there are other ways to perform the same task. – Casperah Mar 19 '13 at 20:48
  • The question is about the best alternative. I do not say it is possible, I say it is fast and has no effects. What is your concrete reason to consider it a "bad idea"? 3. if possible avoid it, sure, but sometimes it is the best choice. – citykid Mar 19 '13 at 20:52
  • If a program have a huge number of files open for writting and then suddenly the power goes. Then you get a harddrive with a huge amount of integrity errors. – Casperah Mar 19 '13 at 21:15
  • This is how I would approach the problem as well. Using a dictionary allows you to both buffer the data and group by device. Once the buffer reaches maximum capacity, it is easy to open a StreamWriter for each key (device) in the dictionary and write all the associated records to the target file. – mbreske Mar 19 '13 at 21:30
  • For this specific task (Devices) there are only 43 devices and if those change in the next 10 years, it'll only be by one or two. However, I am going to have to do the same task but breaking the punches out by which company the employee is working for .. there are about 800 companies currently .. so that sounds like it would be too many open file handles even for the way I want to do it. – Daniel Ward Mar 19 '13 at 21:31
  • However .. (and I could be wrong on this) .. but if I recall, when a file is opened on a remote server, then closed - the file handle stays active for a time afterward. If that's true, then opening / re-opening the file may impact server resources even quicker. – Daniel Ward Mar 19 '13 at 21:33
  • I think your solution of buffering it is the best @Casperah / Steve. I'm new to C# so I'll have to digest your example. I think I'll want to refactor the design a bit to make it more generic so it can split a single record into multiple files to satisfy multiple export requirements simultaneously. – Daniel Ward Mar 19 '13 at 21:41
  • I'm going to mark this as the accepted answer. Reading in the records, sorting by filename, and then writing each filename looks like the most appropriate solution. In the end, I've handed the task off to someone else - it was decided that a more streamlined approach to the problem was to use SQL Server Integration Services. It sounds like I've got the groundwork laid - but SSIS is outside my scope. Thank you everyone. – Daniel Ward Mar 21 '13 at 20:32
  • Thank you for accepting the answer, I think you (your company) have made the right decision to store the data in a database. Databases are perfect for that job. Great work and good luck with your future projects. – Casperah Mar 22 '13 at 07:20
1

You need to set up an array of writers. This is an example of how to do it.

namespace example
{
    class Program
    {
    public static StreamWriter[] writer = new StreamWriter[3];

    static void Main(string[] args)
    {
        writer[0] = new StreamWriter("YourFile1.txt");
        writer[1] = new StreamWriter("YourFile2.txt");
        writer[2] = new StreamWriter("YourFile3.txt");

        writer[0].WriteLine("Line in YourFile1.");
        writer[1].WriteLine("Line in YourFile2.");
        writer[2].WriteLine("Line in YourFile3.");

        writer[0].Close();
        writer[1].Close();
        writer[2].Close();
    }
}

}

Mario
  • 21
  • 1
  • 2
0

I can suggest to keep your data in memory and write to disk only when you reach a certain threesold

const int MAX_MEMORY_BUFFER = 100000; // To be defined according to you memory limits
String strConnection;
String strQuery;
String strPunchFileNameTemplate;

strConnection = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=ClockData.accdb";
strQuery = @"SELECT * FROM ClockPunches";   
strPunchFileNameTemplate = @"C:\PUNCHES\%READER%.TXT";      

Dictionary<string, StringBuilder> data = new Dictionary<string, StringBuilder>();

using (OleDbConnection ConnObj = new OleDbConnection(strConnection))    
{
    OleDbCommand CmdObj = new OleDbCommand(strQuery,ConnObj);
    ConnObj.Open();

    using (OleDbDataReader ReaderObj = CmdObj.ExecuteReader(CommandBehavior.KeyInfo))   
    {
        while(ReaderObj.Read()) 
        {
            DateTime dtTime = ReaderObj.GetDateTime(ReaderObj.GetOrdinal("PunchTime"));
            Int16 intID = ReaderObj.GetInt16(ReaderObj.GetOrdinal("CardNumber"));
            String strReader = ReaderObj.GetString(ReaderObj.GetOrdinal("Device"));

            strReader = GetDeviceFileName(strReader);

            bool dataPresent = data.ContainsKey(strReader);
            if (dataPresent == false)   
            {
                StringBuilder sb = new StringBuilder("EXAMPLE FILE HEADER\r\n");
                data.Add(strReader, sb);
            }

            String outputStr = dtTime.ToString("MM-dd-yyyy HH:mm:ss") + " " + intID.ToString("000000");
            StringBuilder sb = data[strReader];
            sb.AppendLine(outputStr);
            if(sb.Length > MAX_MEMORY_BUFFER)
            {
                String pathStr = strPunchFileNameTemplate.Replace("%READER%",strReader);
                using(StreamWriter sw = new StremWriter(pathStr, true) // Append mode
                {
                    // Write the buffer and set the lenght to zero
                    sw.WriteLine(sb.ToString());
                    sb.Length = 0;
                }
            }
        }
    }

    // Write all the data remaining in memory
    foreach(KeyValuePair<string, StringBuilder> info in data)
    {
        if(info.Value.Length > 0)
        {
          String pathStr = strPunchFileNameTemplate.Replace("%READER%",info.Key);
          using(StreamWriter sw = new StremWriter(pathStr, true) // Append mode
          {
              sw.WriteLine(info.Value.ToString());
          }
        }
    }
}

This code need to be tested, but I wish to give you the general idea. In this way you can balance your IO operations. Lowering by increasing the memory buffer and viceversa. Of course now you need to consider also the memory available to store your data.

Steve
  • 213,761
  • 22
  • 232
  • 286
  • @M4N you were right, but this question intrigued me and so I have changed the approach at the problem. – Steve Mar 19 '13 at 21:00
  • I'm pulling the data from the SQL server that the vendor is managing - this particular server is a controls system integrator - so if we negatively impact the server resources, we could be looking at issues with HVAC / Lighting / Security. I've been told I should only use forward-only cursors on the view they've made available to me. But in light of what you guys have told me about the file handles .. I think NOT sorting the query is going to have more of an impact on their server than sorting it. I get the feeling they are being too protective... but what do I know :) – Daniel Ward Mar 19 '13 at 21:47
  • Well, a forward only cursor is the lightest way to access their data, of course they require the files, so they should accept you writing there. My solution is a tradeoff that will allow you to balance how many IO operations you will do against their precious hardware. Can you tell us how big these files will be on the production server? – Steve Mar 19 '13 at 21:50
  • They are text files with limited fields so they won't be big at all... like a few hundred kilobytes. The Vendor doesn't require the files - our company does .. it was like getting teeth pulled just to get access to the SQL view so I doubt I could get local file access. A sort however is a pretty modest request IMHO – Daniel Ward Mar 19 '13 at 22:04
0

If a single process has 100 or 1000 file handles open over a longer period this is generally considered questionable. But times changed, and this is no issue anymore. So if the situation asks for it, do it.

I was in the position to keep 100, 1000 or even 5000 files open in a process that analysis data in these files. And this would last even hours. I measured on a Windows OS if file reading/writing performance would go down. And this is not at all the case. Due to the now available memory resources of modern machines, having the 5000 file descriptors in memory on OS side does not cause any issues anymore. The OS keeps them sorted (I guess) and the look up of these descriptors is log(n) then, so nothing measurable happens.

Having these handles (file descriptor structs) open is certainly way better then filling memory up with data and then flushing it to disk, file by file.

citykid
  • 9,916
  • 10
  • 55
  • 91